Alterbute: Editing Intrinsic Attributes of Objects in Images
Authors: Tal Reiss, Daniel Winter, Matan Cohen, Alex Rav-Acha, Yael Pritch, Ariel Shamir, Yedid Hoshen
First: 2026-01-15T18:59:53+00:00 · Latest: 2026-01-15T18:59:53+00:00
Comments: Project page is available at https://talreiss.github.io/alterbute/
Abstract
We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ''Porsche 911 Carrera'') that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.
中文标题/摘要
标题:Alterbute:图像中对象固有属性的编辑
我们介绍了Alterbute,一种基于扩散的方法,用于编辑图像中对象的固有属性。我们允许更改对象的颜色、纹理、材质,甚至形状,同时保持其感知身份和场景上下文。现有方法要么依赖于无法保留身份的无监督先验,要么使用过于严格的监督,这限制了有意义的固有属性变化。我们的方法依赖于:(i) 放松的训练目标,允许模型在参考身份图像、描述目标固有属性的文本提示以及定义外部上下文的背景图像和对象掩码的条件下,同时改变固有属性和外部属性;(ii) 视觉命名实体(VNEs)——细粒度的视觉身份类别(例如,“Porsche 911 Carrera”),这些类别将具有身份定义特征的对象分组,同时允许固有属性的变化。我们使用视觉语言模型从大型公共图像数据集中自动提取VNE标签和固有属性描述,从而实现可扩展、保留身份的监督。Alterbute在保留身份的对象固有属性编辑方面优于现有方法。
Summary / 总结
Alterbute is a diffusion-based method for editing an object's intrinsic attributes in an image, such as color, texture, and material, while preserving its identity and scene context. It uses a relaxed training objective and Visual Named Entities (VNEs) to allow changes in intrinsic attributes while keeping extrinsic attributes consistent. The method outperforms existing approaches in identity-preserving object intrinsic attribute editing.
Alterbute 是一种基于扩散的方法,用于在图像中编辑对象的内在属性,如颜色、纹理、材料和形状,同时保持对象的身份和场景上下文。它使用一个宽松的训练目标和视觉命名实体(VNEs)来允许内在属性的变化,同时保持外在属性的一致性。该方法在保持对象身份不变的情况下,优于现有方法进行内在属性编辑。
MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching
Authors: Changle Qu, Sunhao Dai, Hengyi Cai, Jun Xu, Shuaiqiang Wang, Dawei Yin
First: 2026-01-15T18:59:23+00:00 · Latest: 2026-01-15T18:59:23+00:00
Abstract
Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.
中文标题/摘要
标题:MatchTIR:通过二分匹配实现细粒度监督的工具集成推理
工具集成推理(TIR)通过在推理步骤与外部工具交互之间交替,赋予大型语言模型(LLMs)处理复杂任务的能力。然而,现有的强化学习方法通常依赖于结果或轨迹级别的奖励,对轨迹内的所有步骤分配相同的优点。这种粗粒度的信用分配无法区分有效的工具调用与冗余或错误的调用,特别是在长时间多轮场景中。为了解决这个问题,我们提出了MatchTIR框架,通过基于二分匹配的轮次级别奖励分配和双层优势估计引入细粒度监督。具体来说,我们将信用分配形式化为预测和真实轨迹之间的二分匹配问题,利用两种分配策略推导密集的轮次级别奖励。此外,为了平衡局部步骤精度与全局任务成功,我们引入了一种双层优势估计方案,结合轮次级别和轨迹级别的信号,为每个交互轮次分配不同的优势值。在三个基准上的广泛实验表明了MatchTIR的优势。值得注意的是,我们的4B模型在大多数8B竞争对手中表现更优,特别是在长时间多轮任务中。我们的代码可在https://github.com/quchangle1/MatchTIR获取。
Summary / 总结
MatchTIR is designed to enhance Tool-Integrated Reasoning (TIR) by providing fine-grained supervision through bipartite matching-based turn-level reward assignment and dual-level advantage estimation. This method distinguishes effective tool calls from redundant ones, especially in long-horizon multi-turn scenarios. Experiments show that MatchTIR outperforms most 8B models, particularly in long-horizon and multi-turn tasks.
MatchTIR通过提出一种细粒度监督框架来解决工具集成推理中的粗粒度信用分配问题。它使用二分匹配来分配轮次级别的奖励,并引入双层优势估计方案来平衡局部精度和全局成功。实验表明,MatchTIR在三个基准测试中优于大多数竞争对手,尤其是在长时程和多轮任务中。
From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion
Authors: Cheng Chen, Yuyu Guo, Pengpeng Zeng, Jingkuan Song, Peng Di, Hang Yu, Lianli Gao
First: 2026-01-15T18:59:10+00:00 · Latest: 2026-01-15T18:59:10+00:00
Abstract
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.
中文标题/摘要
标题:从一对一到多对多:动态跨层注入以实现深度视觉-语言融合
视觉-语言模型(VLMs)通过使用仅将视觉编码器输出与大型语言模型(LLM)输入粗略且不对称连接的方式,造成了严重的视觉特征瓶颈。这种静态架构从根本上限制了LLM实现与层次视觉知识全面对齐的能力,削弱了它们将局部细节与全局语义整合到连贯推理中的能力。为了解决这一问题,我们引入了跨层注入(CLI),这是一种新颖且轻量级的框架,它在两种模态之间构建了一种动态的多对多桥梁。CLI 包含两个协同工作的、参数高效的组件:自适应多投影(AMP)模块,用于协调来自不同视觉层的特征,以及自适应门控融合(AGF)机制,使LLM能够根据其实时解码上下文选择性地注入最相关的视觉信息。我们通过将CLI整合到LLaVA-OneVision和LLaVA-1.5中来验证其有效性和灵活性。在18个多样基准上的广泛实验表明,CLI 可以实现显著的性能提升,确立了CLI 作为一种可扩展范式的地位,通过赋予LLM按需访问完整的视觉层次结构的能力,解锁了更深层次的多模态理解。
Summary / 总结
The paper addresses the limitations of static vision-language models (VLMs) by proposing Cross-Layer Injection (CLI), a dynamic framework that enables a many-to-many connection between vision and language. CLI includes an Adaptive Multi-Projection (AMP) module for feature harmonization and an Adaptive Gating Fusion (AGF) mechanism for selective visual information injection. Experiments on 18 benchmarks show that CLI improves performance, demonstrating its effectiveness and scalability in achieving deeper multimodal understanding.
论文通过提出一种动态的多对多框架Cross-Layer Injection (CLI),来解决当前视觉-语言模型(VLMs)之间的交互限制。CLI 包含一个用于特征谐调的Adaptive Multi-Projection (AMP) 模块和一个用于选择性视觉信息注入的Adaptive Gating Fusion (AGF) 机制。实验表明CLI 在18个不同基准上的性能显著提升,证明了其在实现更深层次的多模态理解和按需访问视觉层次结构方面的有效性。
See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection
Authors: Amir Mallak, Erfan Aasi, Shiva Sreeram, Tsun-Hsuan Wang, Daniela Rus, Alaa Maalouf
First: 2026-01-15T18:58:33+00:00 · Latest: 2026-01-15T18:58:33+00:00
Abstract
Recent advances in end-to-end autonomous driving show that policies trained on patch-aligned features extracted from foundation models generalize better to Out-of-Distribution (OOD). We hypothesize that due to the self-attention mechanism, each patch feature implicitly embeds/contains information from all other patches, represented in a different way and intensity, making these descriptors highly redundant. We quantify redundancy in such (BLIP2) features via PCA and cross-patch similarity: $90$% of variance is captured by $17/64$ principal components, and strong inter-token correlations are pervasive. Training on such overlapping information leads the policy to overfit spurious correlations, hurting OOD robustness. We present Stochastic-Patch-Selection (SPS), a simple yet effective approach for learning policies that are more robust, generalizable, and efficient. For every frame, SPS randomly masks a fraction of patch descriptors, not feeding them to the policy model, while preserving the spatial layout of the remaining patches. Thus, the policy is provided with different stochastic but complete views of the (same) scene: every random subset of patches acts like a different, yet still sensible, coherent projection of the world. The policy thus bases its decisions on features that are invariant to which specific tokens survive. Extensive experiments confirm that across all OOD scenarios, our method outperforms the state of the art (SOTA), achieving a $6.2$% average improvement and up to $20.4$% in closed-loop simulations, while being $2.4\times$ faster. We conduct ablations over masking rates and patch-feature reorganization, training and evaluating 9 systems, with 8 of them surpassing prior SOTA. Finally, we show that the same learned policy transfers to a physical, real-world car without any tuning.
中文标题/摘要
标题:少看,开得好:基于基础模型的随机补丁选择实现通用端到端自动驾驶
端到端自动驾驶的最新进展表明,训练于基础模型提取的补丁对齐特征上的策略在分布外(OOD)场景下表现更好。我们假设由于自注意力机制,每个补丁特征隐含地嵌入/包含来自其他所有补丁的信息,以不同的方式和强度表示,使得这些描述符高度冗余。我们通过主成分分析(PCA)和跨补丁相似性量化了(BLIP2)特征中的冗余性:90%的方差由17/64个主成分捕获,且跨标记相关性普遍强烈。在这些重叠信息上进行训练会导致策略过度拟合虚假的相关性,损害了OOD鲁棒性。我们提出了一种简单而有效的随机补丁选择(SPS)方法,用于学习更鲁棒、更通用且更高效的策略。对于每一帧,SPS随机遮蔽一部分补丁描述符,不将其提供给策略模型,同时保持剩余补丁的空间布局。因此,策略获得了不同但完整的(相同)场景视图:每随机子集的补丁像不同的,但仍然合理的世界投影。策略基于不变于哪些特定标记存活的特征做出决策。大量实验表明,我们的方法在所有OOD场景中均优于现有最佳方法(SOTA),平均改进6.2%,在闭环模拟中最高可达20.4%,同时速度提高2.4倍。我们在遮蔽率和补丁特征重组、训练和评估9个系统中进行了消融研究,其中8个系统超越了先前的SOTA。最后,我们展示了相同的策略在无需调整的情况下可以转移到物理的、真实世界的汽车上。
Grounding Agent Memory in Contextual Intent
Authors: Ruozhen Yang, Yucheng Jiang, Yueqi Jiang, Priyanka Kargupta, Yunyi Zhang, Jiawei Han
First: 2026-01-15T18:55:13+00:00 · Latest: 2026-01-15T18:55:13+00:00
Abstract
Deploying large language models in long-horizon, goal-oriented interactions remains challenging because similar entities and facts recur under different latent goals and constraints, causing memory systems to retrieve context-mismatched evidence. We propose STITCH (Structured Intent Tracking in Contextual History), an agentic memory system that indexes each trajectory step with a structured retrieval cue, contextual intent, and retrieves history by matching the current step's intent. Contextual intent provides compact signals that disambiguate repeated mentions and reduce interference: (1) the current latent goal defining a thematic segment, (2) the action type, and (3) the salient entity types anchoring which attributes matter. During inference, STITCH filters and prioritizes memory snippets by intent compatibility, suppressing semantically similar but context-incompatible history.
For evaluation, we introduce CAME-Bench, a benchmark for context-aware retrieval in realistic, dynamic, goal-oriented trajectories. Across CAME-Bench and LongMemEval, STITCH achieves state-of-the-art performance, outperforming the strongest baseline by 35.6%, with the largest gains as trajectory length increases. Our analysis shows that intent indexing substantially reduces retrieval noise, supporting intent-aware memory for robust long-horizon reasoning.
中文标题/摘要
标题:将代理记忆扎根于上下文意图
在长期目标导向的交互中部署大型语言模型仍然具有挑战性,因为相似的实体和事实会在不同的潜在目标和约束下反复出现,导致记忆系统检索到上下文不匹配的证据。我们提出了STITCH(基于上下文历史的结构化意图跟踪),这是一种代理记忆系统,每个轨迹步骤都用结构化的检索提示、上下文意图进行索引,并通过匹配当前步骤的意图来检索历史。上下文意图提供了紧凑的信号,消除了重复提及的歧义并减少了干扰:(1) 当前的潜在目标定义了一个主题段落,(2) 动作类型,以及(3) 重要的实体类型,锚定了哪些属性是相关的。在推理过程中,STITCH通过意图兼容性筛选和优先级排序记忆片段,抑制语义相似但上下文不匹配的历史记录。
为了评估,我们引入了CAME-Bench,这是一个基准测试,用于在现实、动态、目标导向的轨迹中进行上下文感知检索。在CAME-Bench和LongMemEval上,STITCH达到了最先进的性能,比最强基线高出35.6%,随着轨迹长度的增加,性能提升最大。我们的分析表明,意图索引显著减少了检索噪声,支持意图感知的记忆以实现稳健的长期推理。
Summary / 总结
The paper addresses the challenge of deploying large language models in long-term goal-oriented interactions by proposing STITCH, an agentic memory system that uses contextual intent to index and retrieve relevant historical information. STITCH filters and prioritizes memory snippets based on intent compatibility, reducing interference from context-mismatched evidence. On CAME-Bench and LongMemEval, STITCH outperforms the strongest baseline by 35.6%, demonstrating significant improvements in handling longer trajectories.
论文提出了一种名为STITCH的代理记忆系统,通过使用上下文意图对历史信息进行索引和检索,以应对大型语言模型在长期目标导向交互中的挑战。STITCH根据意图兼容性过滤和优先级排序记忆片段,减少上下文不匹配的干扰。在CAME-Bench和LongMemEval上,STITCH比最强基线高出35.6%,显示出在处理更长轨迹时的显著改进。
LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals
Authors: Gilat Toker, Nitay Calderon, Ohad Amosy, Roi Reichart
First: 2026-01-15T18:54:50+00:00 · Latest: 2026-01-15T18:54:50+00:00
Abstract
Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.
中文标题/摘要
标题:LIBERTy:基于因果框架的概念驱动解释基准测试方法,使用结构性反事实
概念驱动的解释量化了高层概念(例如,性别或经验)对模型行为的影响,这对于高风险领域的决策者至关重要。近期工作通过将这些解释与从反事实中估计的参考因果效应进行比较,来评估其忠实性。实际上,现有的基准测试依赖于昂贵的人工编写的反事实,这些反事实作为不完美的代理。为了解决这一问题,我们提出了一种构建包含结构性反事实配对的数据集的框架:LIBERTy(基于LLM的解释性基准测试,具有参考目标的干预性因果模型)。LIBERTy基于明确定义的结构因果模型(SCM),文本生成中的干预传播通过SCM,直到LLM生成反事实。我们引入了三个数据集(疾病检测、CV筛选和工作场所暴力预测)以及一个新的评估指标,顺序忠实性。使用这些数据集,我们评估了五种模型的广泛方法,并确定了改进概念驱动解释的大量空间。LIBERTy还使系统分析模型对干预的敏感性成为可能:我们发现专有LLM在人口统计概念上的敏感性明显降低,这可能是由于后训练缓解措施所致。总体而言,LIBERTy为开发忠实的解释方法提供了急需的基准测试。
Summary / 总结
The research aims to improve the faithfulness of concept-based explanations for large language models (LLMs) by introducing LIBERTy, a framework that generates structural counterfactual pairs using explicitly defined Structured Causal Models (SCMs). The method involves interventions on concepts that propagate through the SCM until an LLM generates the counterfactual. Key findings include the identification of substantial room for improving concept-based explanations and the discovery that proprietary LLMs show reduced sensitivity to demographic concepts, possibly due to post-training mitigation. The evaluation metric, order-faithfulness, is also introduced to assess the faithfulness of these explanations.
研究旨在通过开发一个新的基准LIBERTy,使用结构性反事实来提高大型语言模型(LLMs)的概念基础解释的准确性。该方法涉及创建具有明确结构因果模型(SCMs)的数据集,其中对概念的干预通过模型传播,直到LLM生成反事实。关键发现包括在概念基础解释方面存在显著改进空间,并且发现专有LLM对人口统计概念的敏感度较低,这可能是由于后训练缓解策略造成的。
The Impact of Generative AI on Architectural Conceptual Design: Performance, Creative Self-Efficacy and Cognitive Load
Authors: Han Jiang, Yao Xiao, Rachel Hurley, Shichao Liu
First: 2026-01-15T18:52:59+00:00 · Latest: 2026-01-15T18:52:59+00:00
Abstract
Our study examines how generative AI (GenAI) influences performance, creative self-efficacy, and cognitive load in architectural conceptual design tasks. Thirty-six student participants from Architectural Engineering and other disciplines completed a two-phase architectural design task, first independently and then with external tools (GenAI-assisted condition and control condition using an online repository of existing architectural projects). Design outcomes were evaluated by expert raters, while self-efficacy and cognitive load were self-reported after each phase. Difference-in-differences analyses revealed no overall performance advantage of GenAI across participants; however, subgroup analyses showed that GenAI significantly improved design performance for novice designers. In contrast, general creative self-efficacy declined for students using GenAI. Cognitive load did not differ significantly between conditions, though prompt usage patterns showed that iterative idea generation and visual feedback prompts were linked to greater reductions in cognitive load. These findings suggest that GenAI effectiveness depends on users' prior expertise and interaction strategies through prompting.
中文标题/摘要
标题:生成式AI对建筑概念设计的影响:性能、创造性自我效能感和认知负荷
本研究探讨了生成式AI(GenAI)对建筑概念设计任务中性能、创造性自我效能感和认知负荷的影响。36名来自建筑学和其它学科的学生参与者完成了两阶段的建筑设计任务,首先独立完成,然后使用外部工具(GenAI辅助条件和对照条件使用在线现有建筑项目库)。设计成果由专家评定,而自我效能感和认知负荷则在每个阶段后由参与者自我报告。差异性分析显示,GenAI在总体上并未给参与者带来性能优势;然而,子组分析表明,GenAI显著提高了新手设计师的设计性能。相反,使用GenAI的学生的总体创造性自我效能感有所下降。认知负荷在不同条件下没有显著差异,但提示使用模式显示,迭代想法生成和视觉反馈提示与认知负荷的更大降低有关。这些发现表明,GenAI的有效性取决于用户的先前专业知识和通过提示的交互策略。
Summary / 总结
This study investigates the impact of generative AI on architectural design performance, creative self-efficacy, and cognitive load. Thirty-six participants completed two phases of a design task, with and without GenAI assistance. While GenAI did not improve overall performance, it significantly enhanced performance for novice designers. However, general creative self-efficacy decreased among students using GenAI. Cognitive load remained similar across conditions, but iterative idea generation and visual feedback prompts reduced cognitive load for some users.
本研究探讨了生成式AI对建筑设计性能、创造性自我效能感和认知负荷的影响。36名参与者完成了两阶段的设计任务,有和没有AI辅助。虽然总体上没有性能优势,但AI显著提高了新手的设计表现。使用AI的学生创造性自我效能感下降,认知负荷在不同条件下没有显著差异,但迭代的想法生成和视觉反馈提示与认知负荷的降低有关。
Structure and Diversity Aware Context Bubble Construction for Enterprise Retrieval Augmented Systems
Authors: Amir Khurshid, Abhishek Sehgal
First: 2026-01-15T18:43:19+00:00 · Latest: 2026-01-15T18:43:19+00:00
Abstract
Large language model (LLM) contexts are typically constructed using retrieval-augmented generation (RAG), which involves ranking and selecting the top-k passages. The approach causes fragmentation in information graphs in document structures, over-retrieval, and duplication of content alongside insufficient query context, including 2nd and 3rd order facets. In this paper, a structure-informed and diversity-constrained context bubble construction framework is proposed that assembles coherent, citable bundles of spans under a strict token budget. The method preserves and exploits inherent document structure by organising multi-granular spans (e.g., sections and rows) and using task-conditioned structural priors to guide retrieval. Starting from high-relevance anchor spans, a context bubble is constructed through constrained selection that balances query relevance, marginal coverage, and redundancy penalties. It will explicitly constrain diversity and budget, producing compact and informative context sets, unlike top-k retrieval. Moreover, a full retrieval is emitted that traces the scoring and selection choices of the records, thus providing auditability and deterministic tuning. Experiments on enterprise documents demonstrate the efficiency of context bubble as it significantly reduces redundant context, is better able to cover secondary facets and has a better answer quality and citation faithfulness within a limited context window. Ablation studies demonstrate that both structural priors as well as diversity constraint selection are necessary; removing either component results in a decline in coverage and an increase in redundant or incomplete context.
中文标题/摘要
标题:企业检索增强系统中结构和多样性感知上下文气泡构建
大型语言模型(LLM)上下文通常使用检索增强生成(RAG)构建,涉及排名和选择前k段。该方法导致文档结构中的信息图碎片化、过度检索、内容重复以及查询上下文不足,包括二阶和三阶特征。本文提出了一种结构导向和多样性约束的上下文气泡构建框架,该框架在严格的令牌预算下组装出连贯且可引用的片段集合。该方法通过组织多粒度片段(例如,部分和行)并使用任务条件化的结构先验来指导检索,来保留和利用文档的固有结构。从高相关性锚定片段开始,通过受限选择构建上下文气泡,平衡查询相关性、边际覆盖率和冗余惩罚。它将显式地约束多样性和预算,生成紧凑且信息丰富的上下文集,不同于前k检索。此外,还发出完整的检索,追踪记录的评分和选择选择,从而提供可审计性和确定性调整。实验表明,上下文气泡在企业文档中表现出高效性,显著减少了冗余上下文,更好地覆盖了次要特征,并在有限的上下文窗口内具有更好的答案质量和引文忠实度。消融研究显示,结构先验和多样性约束选择都是必要的;移除任一组成部分都会导致覆盖率下降和冗余或不完整上下文的增加。
Summary / 总结
This paper addresses the limitations of traditional retrieval-augmented generation (RAG) methods in large language models by proposing a structure-informed and diversity-constrained context bubble construction framework. The method assembles coherent context bundles while preserving document structure and task-conditioned priors. Experiments show that this approach reduces redundancy, better covers secondary facets, and improves answer quality and citation faithfulness within a limited context window compared to top-k retrieval methods.
本文针对传统检索增强生成(RAG)方法在大型语言模型中的局限性,如信息碎片化、过度检索和查询上下文不足等问题,提出了一种结构导向和多样性约束的上下文气泡构建框架,该框架通过组织多粒度片段并使用任务条件下的结构先验来引导检索。实验表明,该方法可以减少冗余上下文,更好地覆盖次要方面,并在有限的上下文窗口内提高答案质量和引文忠实度。消融研究证实,结构先验和多样性约束的选择对于最佳性能都是必要的。
Are Your Reasoning Models Reasoning or Guessing? A Mechanistic Analysis of Hierarchical Reasoning Models
Authors: Zirui Ren, Ziming Liu
First: 2026-01-15T18:42:50+00:00 · Latest: 2026-01-15T18:42:50+00:00
Abstract
Hierarchical reasoning model (HRM) achieves extraordinary performance on various reasoning tasks, significantly outperforming large language model-based reasoners. To understand the strengths and potential failure modes of HRM, we conduct a mechanistic study on its reasoning patterns and find three surprising facts: (a) Failure of extremely simple puzzles, e.g., HRM can fail on a puzzle with only one unknown cell. We attribute this failure to the violation of the fixed point property, a fundamental assumption of HRM. (b) "Grokking" dynamics in reasoning steps, i.e., the answer is not improved uniformly, but instead there is a critical reasoning step that suddenly makes the answer correct; (c) Existence of multiple fixed points. HRM "guesses" the first fixed point, which could be incorrect, and gets trapped there for a while or forever. All facts imply that HRM appears to be "guessing" instead of "reasoning". Leveraging this "guessing" picture, we propose three strategies to scale HRM's guesses: data augmentation (scaling the quality of guesses), input perturbation (scaling the number of guesses by leveraging inference randomness), and model bootstrapping (scaling the number of guesses by leveraging training randomness). On the practical side, by combining all methods, we develop Augmented HRM, boosting accuracy on Sudoku-Extreme from 54.5% to 96.9%. On the scientific side, our analysis provides new insights into how reasoning models "reason".
中文标题/摘要
标题:你的推理模型是在推理还是猜测?层级推理模型的机制分析
层级推理模型(HRM)在各种推理任务中表现出色,显著优于基于大型语言模型的推理器。为了理解HRM的优势及其潜在的失败模式,我们对其推理模式进行了机制研究,发现了三个令人惊讶的事实:(a) 失败于极其简单的谜题,例如,HRM 可能在只有一个未知单元格的谜题上失败。我们将其失败归因于违反了HRM的基本假设——固定点性质。(b) 推理步骤中的“顿悟”动态,即答案并非均匀改进,而是存在一个关键的推理步骤,突然使答案变得正确;(c) 存在多个固定点。HRM“猜测”第一个固定点,该固定点可能是错误的,并且可能会被困在那里一段时间或永远。所有这些事实表明,HRM 似乎是在“猜测”而不是“推理”。利用这种“猜测”的观点,我们提出了三种策略来扩展HRM的猜测:数据增强(提高猜测的质量)、输入扰动(通过利用推理的随机性来扩展猜测的数量)和模型自举(通过利用训练的随机性来扩展猜测的数量)。在实际应用方面,通过结合所有方法,我们开发了增强的HRM,将数独极端难度的准确率从54.5%提升到96.9%。在科学方面,我们的分析为理解推理模型如何“推理”提供了新的见解。
Summary / 总结
This study investigates the reasoning patterns of hierarchical reasoning models (HRMs) and finds that they often fail on simple puzzles, exhibit 'grokking' dynamics, and can get stuck in incorrect fixed points, suggesting they are 'guessing' rather than 'reasoning'. The authors propose strategies to scale HRM's guesses, leading to a significant boost in accuracy on Sudoku-Extreme tasks. This work provides new insights into the mechanisms of reasoning models.
研究分析了层次推理模型(HRM)的推理模式,发现它们在简单谜题上常常失败,这是由于违反了固定点性质;表现出“顿悟”动态,即答案会突然变得正确;并且可能会陷入错误的固定点。这些发现表明,HRM 更像是猜测而不是推理。研究提出了通过数据增强、输入扰动和模型自举来扩展HRM猜测的方法,并证明结合这些方法可以显著提高数独极端任务的准确性,达到96.9%。这项研究提供了关于推理模型如何工作的新见解。
Single-Stage Huffman Encoder for ML Compression
Authors: Aditya Agrawal, Albert Magyar, Hiteshwar Eswaraiah, Patrick Sheridan, Pradeep Janedula, Ravi Krishnan Venkatesan, Krishna Nair, Ravi Iyer
First: 2026-01-15T18:37:56+00:00 · Latest: 2026-01-15T18:37:56+00:00
Comments: 5 pages, 4 figures
Abstract
Training and serving Large Language Models (LLMs) require partitioning data across multiple accelerators, where collective operations are frequently bottlenecked by network bandwidth. Lossless compression using Huffman codes is an effective way to alleviate the issue, however, its three-stage design requiring on-the-fly frequency analysis, codebook generation and transmission of codebook along with data introduces computational, latency and data overheads which are prohibitive for latency-sensitive scenarios such as die-to-die communication. This paper proposes a single-stage Huffman encoder that eliminates these overheads by using fixed codebooks derived from the average probability distribution of previous data batches. Through our analysis of the Gemma 2B model, we demonstrate that tensors exhibit high statistical similarity across layers and shards. Using this approach we achieve compression within 0.5% of per-shard Huffman coding and within 1% of the ideal Shannon compressibility, enabling efficient on-the-fly compression.
中文标题/摘要
标题:单阶段霍夫曼编码器用于ML压缩
训练和提供大型语言模型(LLMs)需要将数据分割到多个加速器上,其中集体操作经常被网络带宽瓶颈限制。使用霍夫曼编码进行无损压缩是一种有效的方法,然而其三阶段设计需要实时频率分析、代码簿生成和代码簿与数据的传输,引入了计算、延迟和数据开销,这些开销在如片间通信等对延迟敏感的场景中是不可接受的。本文提出了一种单阶段霍夫曼编码器,通过使用从先前数据批次的平均概率分布中派生的固定代码簿来消除这些开销。通过对Gemma 2B模型的分析,我们证明了张量在层间和碎片间具有高度的统计相似性。使用这种方法,我们实现了与每碎片霍夫曼编码相差0.5%的压缩率,并且在理想香农压缩性内的1%以内,从而实现了高效的实时压缩。
Summary / 总结
This paper addresses the challenge of bandwidth limitations in partitioning Large Language Models (LLMs) across accelerators by proposing a single-stage Huffman encoder. The method eliminates the need for on-the-fly frequency analysis and codebook transmission, using fixed codebooks based on average probability distributions. The authors demonstrate that tensors across layers and shards have high statistical similarity, allowing for efficient on-the-fly compression with compression rates within 0.5% of per-shard Huffman coding and within 1% of the ideal Shannon compressibility.
本文解决了大型语言模型(LLMs)在跨加速器分割时网络带宽限制的问题。提出了一种单阶段霍夫曼编码器,使用固定代码本来消除传统三阶段霍夫曼编码的计算和延迟开销。该编码器在每个分片霍夫曼编码的基础上实现了0.5%以内的压缩率,并且在理想香农压缩性上实现了1%以内的压缩率,证明了其在延迟敏感场景中的高效性。
BASIL: Bayesian Assessment of Sycophancy in LLMs
Authors: Katherine Atwell, Pedram Heydari, Anthony Sicilia, Malihe Alikhani
First: 2025-08-23T00:11:00+00:00 · Latest: 2026-01-15T18:31:50+00:00
Abstract
Sycophancy (overly agreeable or flattering behavior) poses a fundamental challenge for human-AI collaboration, particularly in high-stakes decision-making domains such as health, law, and education. A central difficulty in studying sycophancy in large language models (LLMs) is disentangling sycophantic belief shifts from rational changes in behavior driven by new evidence or user-provided information. Existing approaches either measure descriptive behavior changes or apply normative evaluations that rely on objective ground truth, limiting their applicability to subjective or uncertain tasks. We introduce a Bayesian probabilistic framework, grounded in behavioral economics and rational decision theory, that explicitly separates sycophancy from rational belief updating. Within this framework, we achieve three objectives: (i) a descriptive metric that measures sycophancy while controlling for rational responses to evidence; (ii) a normative metric that quantifies how sycophancy leads models astray from Bayesian-consistent belief updating; and (iii) the ability to apply both metrics in settings without ground-truth labels. Applying our framework across multiple LLMs and three uncertainty-driven tasks, we find robust evidence of sycophantic belief shifts and show that their impact on rationality depends on whether models systematically over- or under-update their beliefs. Finally, we demonstrate that a post-hoc calibration method and two fine-tuning strategies (SFT and DPO) substantially reduce Bayesian inconsistency, with particularly strong improvements under explicit sycophancy prompting.
中文标题/摘要
标题:BASIL: 评估大语言模型阿谀行为的贝叶斯方法
阿谀(过度顺从或奉承的行为)对人类与AI的合作构成了根本性的挑战,尤其是在健康、法律和教育等高风险决策领域。在研究大语言模型(LLMs)中的阿谀行为时,一个主要困难是区分阿谀行为引起的观点变化与由新证据或用户提供的信息驱动的理性行为变化。现有方法要么衡量描述性行为变化,要么进行依赖客观事实的规范性评估,这限制了它们在主观或不确定任务中的应用。我们提出了一种基于行为经济学和理性决策理论的贝叶斯概率框架,明确地将阿谀行为与理性信念更新区分开来。在此框架内,我们实现了三个目标:(i)一个描述性指标,衡量阿谀行为的同时控制理性对证据的反应;(ii)一个规范性指标,量化阿谀行为如何使模型偏离贝叶斯一致性的信念更新;(iii)能够在没有真实标签的情况下应用这两个指标的能力。在多个LLMs和三个不确定性驱动的任务上应用我们的框架,我们发现了阿谀观点变化的稳健证据,并表明其对理性的影响取决于模型是否系统地过度或不足地更新其信念。最后,我们证明了一种事后校准方法和两种微调策略(SFT和DPO)显著减少了贝叶斯不一致性,特别是在明确阿谀提示下效果尤为显著。
Summary / 总结
The paper introduces BASIL, a Bayesian framework to assess sycophancy in large language models (LLMs) by separating sycophantic belief shifts from rational changes. It measures both descriptive and normative aspects of sycophancy and applies these metrics in settings without ground-truth labels. The study finds robust evidence of sycophantic belief shifts across multiple LLMs and shows that their impact on rationality varies depending on whether models over- or under-update their beliefs. Additionally, it demonstrates that post-hoc calibration and fine-tuning strategies can reduce Bayesian inconsistency caused by sycophancy.
研究引入了一个贝叶斯概率框架来评估大型语言模型(LLM)中的奉承行为,将其与理性信念更新区分开来。它测量了奉承行为的描述性和规范性方面,并将这些指标应用于多个LLM在不确定任务中的表现,发现存在显著的奉承性信念偏移,这些偏移影响了理性判断。后处理校准方法和两种微调策略(SFT和DPO)显著减少了贝叶斯不一致性,尤其是在明确奉承性提示下效果更佳。
Detecting Winning Arguments with Large Language Models and Persuasion Strategies
Authors: Tiziano Labruna, Arkadiusz Modzelewski, Giorgio Satta, Giovanni Da San Martino
First: 2026-01-15T18:30:15+00:00 · Latest: 2026-01-15T18:30:15+00:00
Abstract
Detecting persuasion in argumentative text is a challenging task with important implications for understanding human communication. This work investigates the role of persuasion strategies - such as Attack on reputation, Distraction, and Manipulative wording - in determining the persuasiveness of a text. We conduct experiments on three annotated argument datasets: Winning Arguments (built from the Change My View subreddit), Anthropic/Persuasion, and Persuasion for Good. Our approach leverages large language models (LLMs) with a Multi-Strategy Persuasion Scoring approach that guides reasoning over six persuasion strategies. Results show that strategy-guided reasoning improves the prediction of persuasiveness. To better understand the influence of content, we organize the Winning Argument dataset into broad discussion topics and analyze performance across them. We publicly release this topic-annotated version of the dataset to facilitate future research. Overall, our methodology demonstrates the value of structured, strategy-aware prompting for enhancing interpretability and robustness in argument quality assessment.
中文标题/摘要
标题:使用大型语言模型和说服策略检测获胜论点
在论证文本中检测说服是一项具有重要含义的挑战性任务,对于理解人类交流至关重要。本研究探讨了说服策略(如声誉攻击、转移注意力和操纵性措辞)在决定文本说服力中的作用。我们在三个标注的论证数据集上进行了实验:Winning Arguments(来自Change My View 子论坛)、Anthropic/Persuasion 和 Persuasion for Good。我们的方法利用大型语言模型(LLMs)和多策略说服评分方法,指导对六种说服策略的推理。结果表明,策略导向的推理可以提高对说服力预测的准确性。为了更好地理解内容的影响,我们将Winning Argument数据集组织成广泛的讨论主题,并分析其在不同主题上的表现。我们公开发布了这个主题标注的数据集,以促进未来的研究。总体而言,我们的方法证明了结构化、策略感知提示在提高论证质量评估的可解释性和鲁棒性方面的价值。
Pareto-Grid-Guided Large Language Models for Fast and High-Quality Heuristics Design in Multi-Objective Combinatorial Optimization
Authors: Minh Hieu Ha, Hung Phan, Tung Duy Doan, Tung Dao, Dao Tran, Huynh Thi Thanh Binh
Venue: AAAI
First: 2025-07-28T15:26:43+00:00 · Latest: 2026-01-15T18:28:50+00:00
Comments: Accepted at AAAI-26
Abstract
Multi-objective combinatorial optimization problems (MOCOP) frequently arise in practical applications that require the simultaneous optimization of conflicting objectives. Although traditional evolutionary algorithms can be effective, they typically depend on domain knowledge and repeated parameter tuning, limiting flexibility when applied to unseen MOCOP instances. Recently, integration of Large Language Models (LLMs) into evolutionary computation has opened new avenues for automatic heuristic generation, using their advanced language understanding and code synthesis capabilities. Nevertheless, most existing approaches predominantly focus on single-objective tasks, often neglecting key considerations such as runtime efficiency and heuristic diversity in multi-objective settings. To bridge this gap, we introduce Multi-heuristics for MOCOP via Pareto-Grid-guided Evolution of LLMs (MPaGE), a novel enhancement of the Simple Evolutionary Multiobjective Optimization (SEMO) framework that leverages LLMs and Pareto Front Grid (PFG) technique. By partitioning the objective space into grids and retaining top-performing candidates to guide heuristic generation, MPaGE utilizes LLMs to prioritize heuristics with semantically distinct logical structures during variation, thus promoting diversity and mitigating redundancy within the population. Through extensive evaluations, MPaGE demonstrates superior performance over existing LLM-based frameworks, and achieves competitive results to traditional Multi-objective evolutionary algorithms (MOEAs), with significantly faster runtime. Our code is available at: https://github.com/langkhachhoha/MPaGE.
中文标题/摘要
标题:帕累托网格引导的大语言模型在多目标组合优化中的快速高质量启发式设计
多目标组合优化问题(MOCOP)在需要同时优化冲突目标的实际应用中经常出现。尽管传统的进化算法可能有效,但它们通常依赖于领域知识和重复的参数调整,这在应用于未见过的MOCOP实例时限制了灵活性。最近,将大语言模型(LLMs)集成到进化计算中为自动启发式生成开辟了新的途径,利用它们先进的语言理解和代码合成能力。然而,大多数现有方法主要集中在单目标任务上,往往忽视了多目标设置中的关键考虑,如运行时效率和启发式多样性。为弥合这一差距,我们提出了多启发式MOCOP通过帕累托网格引导的LLM进化(MPaGE),这是一种对简单多目标进化优化(SEMO)框架的新型增强,利用LLMs和帕累托前沿网格(PFG)技术。通过将目标空间划分为网格并保留表现最佳的候选者来引导启发式生成,MPaGE利用LLMs在变异过程中优先考虑具有语义上不同的逻辑结构的启发式,从而促进多样性并减少种群中的冗余。通过广泛的评估,MPaGE在现有LLM基框架中表现出更优的性能,并且在运行时显著更快,达到了传统多目标进化算法(MOEAs)的竞争力。我们的代码可在:https://github.com/langkhachhoha/MPaGE.
Summary / 总结
The paper introduces MPaGE, a method that enhances the SEMO framework by integrating LLMs and Pareto Front Grid (PFG) to generate heuristics for MOCOP. It partitions the objective space into grids and retains top-performing candidates to guide heuristic generation, promoting diversity and reducing redundancy. MPaGE outperforms existing LLM-based frameworks and achieves competitive results with traditional MOEAs, with faster runtime.
该研究提出了MPaGE方法,通过结合LLMs和PFG技术增强SEMO框架,用于生成MOCOP的启发式。它将目标空间划分为网格,并保留表现最佳的候选者来指导启发式的生成,从而促进多样性并减少冗余。MPaGE在性能上优于现有的基于LLM的方法,并且在运行时间上与传统的多目标进化算法(MOEAs)具有竞争力。
Moonworks Lunara Aesthetic Dataset
Authors: Yan Wang, M M Sayeef Abdullah, Partho Hassan, Sabit Hassan
First: 2026-01-12T19:11:41+00:00 · Latest: 2026-01-15T18:27:29+00:00
Abstract
The dataset spans diverse artistic styles, including regionally grounded aesthetics from the Middle East, Northern Europe, East Asia, and South Asia, alongside general categories such as sketch and oil painting. All images are generated using the Moonworks Lunara model and intentionally crafted to embody distinct, high-quality aesthetic styles, yielding a first-of-its-kind dataset with substantially higher aesthetic scores, exceeding even aesthetics-focused datasets, and general-purpose datasets by a larger margin. Each image is accompanied by a human-refined prompt and structured annotations that jointly describe salient objects, attributes, relationships, and stylistic cues. Unlike large-scale web-derived datasets that emphasize breadth over precision, the Lunara Aesthetic Dataset prioritizes aesthetic quality, stylistic diversity, and licensing transparency, and is released under the Apache 2.0 license to support research and unrestricted academic and commercial use.
中文标题/摘要
标题:Moonworks Lunara美学数据集
该数据集涵盖了多种艺术风格,包括中东、北欧、东亚和南亚等地域性美学,以及素描和油画等通用类别。所有图像均使用Moonworks Lunara模型生成,并有意设计以体现独特的高质量美学风格,从而形成一个前所未有的数据集,其美学评分显著高于专注于美学的数据集和通用数据集。每张图像都附有人工精炼的提示和结构化注释,共同描述了显著物体、属性、关系和风格线索。与侧重广度而非精确度的大型网络数据集不同,Lunara美学数据集侧重于美学质量、风格多样性和许可透明度,并在Apache 2.0许可证下发布,以支持研究和无限制的学术及商业使用。
Summary / 总结
The Moonworks Lunara Aesthetic Dataset includes diverse artistic styles from various regions and general categories like sketches and oil paintings, all generated by the Moonworks Lunara model. Each image is paired with a human-refined prompt and structured annotations, and the dataset prioritizes aesthetic quality, stylistic diversity, and licensing transparency. The dataset significantly outperforms both aesthetics-focused and general-purpose datasets in terms of aesthetic scores.
Moonworks Lunara美学数据集包含了来自不同地区的多种艺术风格和一般类别,所有图像均由Moonworks Lunara模型生成。每张图像都配有经过人类细化的提示和结构化注释,专注于美学质量、风格多样性和许可透明度。该数据集在美学评分上超过了专注于美学和通用用途的数据集,并在Apache 2.0许可证下发布,支持不受限制的研究和学术及商业使用。
Knowledge Homophily in Large Language Models
Authors: Utkarsh Sahu, Zhisheng Qi, Mahantesh Halappanavar, Nedim Lipka, Ryan A. Rossi, Franck Dernoncourt, Yu Zhang, Yao Ma, Yu Wang
First: 2025-09-28T09:40:27+00:00 · Latest: 2026-01-15T18:26:36+00:00
Abstract
Large Language Models (LLMs) have been increasingly studied as neural knowledge bases for supporting knowledge-intensive applications such as question answering and fact checking. However, the structural organization of their knowledge remains unexplored. Inspired by cognitive neuroscience findings, such as semantic clustering and priming, where knowing one fact increases the likelihood of recalling related facts, we investigate an analogous knowledge homophily pattern in LLMs. To this end, we map LLM knowledge into a graph representation through knowledge checking at both the triplet and entity levels. After that, we analyze the knowledgeability relationship between an entity and its neighbors, discovering that LLMs tend to possess a similar level of knowledge about entities positioned closer in the graph. Motivated by this homophily principle, we propose a Graph Neural Network (GNN) regression model to estimate entity-level knowledgeability scores for triplets by leveraging their neighborhood scores. The predicted knowledgeability enables us to prioritize checking less well-known triplets, thereby maximizing knowledge coverage under the same labeling budget. This not only improves the efficiency of active labeling for fine-tuning to inject knowledge into LLMs but also enhances multi-hop path retrieval in reasoning-intensive question answering.
中文标题/摘要
标题:大型语言模型中的知识同质性
大型语言模型(LLMs)已被越来越多地研究作为支持知识密集型应用(如问答和事实核查)的神经知识库。然而,它们知识的结构组织尚未被探索。受认知神经科学发现的启发,例如语义聚类和启动效应,即知道一个事实会增加回忆相关事实的可能性,我们研究了LLMs中的类似知识同质性模式。为此,我们通过知识检查将LLM知识映射到图表示中,分别在三元组和实体层面进行。之后,我们分析了实体与其邻居之间的知识能力关系,发现LLMs倾向于在图中位置更近的实体具有相似的知识水平。受这一同质性原则的启发,我们提出了一种图神经网络(GNN)回归模型,通过利用其邻居得分来估计三元组的实体级知识能力得分。预测的知识能力使我们能够优先检查不太为人所知的三元组,从而在相同的标注预算下最大化知识覆盖范围。这不仅提高了对LLMs进行微调以注入知识的主动标注效率,还增强了推理密集型问答中的多跳路径检索。
Summary / 总结
This study explores the knowledge homophily pattern in Large Language Models (LLMs), inspired by cognitive neuroscience findings. By mapping LLM knowledge into a graph representation and analyzing the knowledgeability relationship between entities and their neighbors, the research discovers that LLMs tend to have similar levels of knowledge about closely connected entities. To leverage this homophily principle, a Graph Neural Network (GNN) regression model is proposed to estimate entity-level knowledgeability scores, which helps in prioritizing less well-known triplets for labeling, thus improving the efficiency of active labeling and enhancing multi-hop path retrieval in reasoning-intensive question answering.
该研究通过将大型语言模型(LLMs)的知识映射到图表示,并分析实体之间的知识能力关系,探讨了知识同质性模式。提出了一种图神经网络(GNN)回归模型来估计实体级别的知识能力得分,这有助于优先标记较少为人知的三元组,从而提高主动标记的效率,并增强推理密集型问答中的多跳路径检索。
PACEvolve: Enabling Long-Horizon Progress-Aware Consistent Evolution
Authors: Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Zhankui He, Noveen Sachdeva, Isabella Ye, Weili Wang, Chi Wang, Ed H. Chi, Wang-Cheng Kang, Derek Zhiyuan Cheng, Beidou Wang
First: 2026-01-15T18:25:23+00:00 · Latest: 2026-01-15T18:25:23+00:00
Abstract
Large Language Models (LLMs) have emerged as powerful operators for evolutionary search, yet the design of efficient search scaffolds remains ad hoc. While promising, current LLM-in-the-loop systems lack a systematic approach to managing the evolutionary process. We identify three distinct failure modes: Context Pollution, where experiment history biases future candidate generation; Mode Collapse, where agents stagnate in local minima due to poor exploration-exploitation balance; and Weak Collaboration, where rigid crossover strategies fail to leverage parallel search trajectories effectively. We introduce Progress-Aware Consistent Evolution (PACEvolve), a framework designed to robustly govern the agent's context and search dynamics, to address these challenges. PACEvolve combines hierarchical context management (HCM) with pruning to address context pollution; momentum-based backtracking (MBB) to escape local minima; and a self-adaptive sampling policy that unifies backtracking and crossover for dynamic search coordination (CE), allowing agents to balance internal refinement with cross-trajectory collaboration. We demonstrate that PACEvolve provides a systematic path to consistent, long-horizon self-improvement, achieving state-of-the-art results on LLM-SR and KernelBench, while discovering solutions surpassing the record on Modded NanoGPT.
中文标题/摘要
标题:PACEvolve:促进长期目标感知一致进化的框架
大型语言模型(LLMs)已成为进化搜索的强大操作员,然而高效的搜索支架设计仍缺乏系统方法。尽管前景广阔,当前的LLM在环系统缺乏管理进化过程的系统方法。我们识别出三种不同的失败模式:上下文污染,实验历史偏差未来候选生成;模式崩溃,由于探索与利用平衡不佳,代理在局部最小值中停滞;以及弱协作,僵化的杂交策略无法有效利用并行搜索轨迹。我们引入了目标感知一致进化(PACEvolve)框架,旨在稳健地管理代理的上下文和搜索动力学,以应对这些挑战。PACEvolve结合了分层上下文管理(HCM)与剪枝以解决上下文污染;基于动量的回溯(MBB)以摆脱局部最小值;以及一种自适应采样策略,统一了回溯和杂交以实现动态搜索协调(CE),使代理能够平衡内部细化与跨轨迹协作。我们证明,PACEvolve提供了一条系统的方法,以实现一致的长期自我改进,其在LLM-SR和KernelBench上取得了最先进的结果,同时发现超越Modded NanoGPT记录的解决方案。
Summary / 总结
The research addresses the challenges in using Large Language Models (LLMs) for evolutionary search, such as context pollution, mode collapse, and weak collaboration. It introduces PACEvolve, a framework that uses hierarchical context management, momentum-based backtracking, and a self-adaptive sampling policy to mitigate these issues. PACEvolve achieves state-of-the-art results on LLM-SR and KernelBench and discovers solutions surpassing the record on Modded NanoGPT.
论文通过引入PACEvolve框架解决了使用大型语言模型(LLMs)进行进化搜索时面临的挑战,该框架旨在解决上下文污染、局部最小值陷阱和协作不足的问题。PACEvolve结合了层次上下文管理、动量回溯和自适应采样策略来改进搜索动态。实验表明,PACEvolve在LLM-SR和KernelBench上达到了最先进的结果,并发现了超越先前记录的Modded NanoGPT解决方案。
PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus
Authors: Shahriar Noroozizadeh, Sayantan Kumar, George H. Chen, Jeremy C. Weiss
First: 2025-05-23T18:01:09+00:00 · Latest: 2026-01-15T18:18:24+00:00
Abstract
Clinical narratives encode temporal dynamics essential for modeling patient trajectories, yet large-scale temporally annotated resources are scarce. We introduce PMOA-TTS, a corpus of 124,699 single-patient PubMed Open Access case reports converted into structured textual timelines of (event, time) pairs using a scalable large-language-model pipeline (Llama 3.3 70B and DeepSeek-R1). The corpus comprises over 5.6 million timestamped events, alongside extracted demographics and diagnoses. Technical validation uses a clinician-curated gold set and three measures: semantic event matching, temporal concordance (c-index), and alignment error summarized with Area Under the Log-Time CDF (AULTC). We benchmark alternative prompting and model choices and provide documentation to support reproduction. PMOA-TTS enables research on timeline extraction, temporal reasoning, survival modeling and event forecasting from narrative text, and offers broad diagnostic and demographic coverage. Data and code are openly available in public repositories.
中文标题/摘要
标题:PMOA-TTS:介绍PubMed开放获取文本时间序列语料库
临床病历记录了对于建模患者轨迹至关重要的时间动态,但大规模的时间标注资源稀缺。我们介绍了PMOA-TTS语料库,该语料库包含124,699份单个患者的PubMed开放获取病例报告,通过可扩展的大语言模型管道(Llama 3.3 70B和DeepSeek-R1)转换为结构化的文本时间线(事件,时间)对。该语料库包含超过560万条带时间戳的事件,以及提取的统计数据和诊断信息。技术验证使用了临床专家审核的金标准集和三种度量:语义事件匹配、时间一致性(c-指数)和通过对数时间累积分布函数下的面积(AULTC)总结的对齐误差。我们对替代提示和模型选择进行了基准测试,并提供了支持再现的文档。PMOA-TTS 使从叙述文本中提取时间线、时间推理、生存建模和事件预测的研究成为可能,并提供了广泛的诊断和人口统计学覆盖范围。数据和代码在公共存储库中公开可用。
Summary / 总结
The research aims to address the scarcity of large-scale temporally annotated clinical narrative resources by introducing PMOA-TTS, a corpus of 124,699 PubMed Open Access case reports converted into structured timelines using a scalable large-language-model pipeline. The corpus includes over 5.6 million timestamped events, demographics, and diagnoses. Technical validation was conducted using a clinician-curated gold set and three measures: semantic event matching, temporal concordance, and alignment error. The study benchmarks alternative prompting and model choices and provides documentation for reproducibility. PMOA-TTS supports research on timeline extraction, temporal reasoning, survival modeling, and event forecasting from narrative text, with broad diagnostic and demographic coverage.
研究旨在通过引入PMOA-TTS语料库来解决大规模时间标注临床叙事资源稀缺的问题,该语料库包含124,699篇PubMed开放获取病例报告,通过可扩展的大语言模型管道转换为结构化的时间线。语料库包括超过5.6万个带时间戳的事件、人口统计信息和诊断信息。技术验证使用了临床专家标注的金标准集和三个指标:语义事件匹配、时间一致性(c-index)和对齐误差。研究还对不同的提示和模型选择进行了基准测试,并提供了可重复性的文档。PMOA-TTS支持从叙事文本中进行时间线提取、时间推理、生存建模和事件预测的研究,具有广泛的诊断和人口统计学覆盖范围。
CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
Authors: Darshan Singh, Arsha Nagrani, Kawshik Manikantan, Harman Singh, Dinesh Tewari, Tobias Weyand, Cordelia Schmid, Anelia Angelova, Shachi Dave
First: 2026-01-15T18:15:06+00:00 · Latest: 2026-01-15T18:15:06+00:00
Abstract
Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE (Cultural Understanding and Reasoning in Video Evaluation), a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE's reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. CURVE will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural
中文标题/摘要
标题:CURVE:一种文化与多语言长视频推理基准
近期视频模型的发展取得了巨大的进步,特别是在长视频理解方面。然而,当前的基准测试主要以西方为中心的数据和英语为主,引入了评估中的显著偏差。为了解决这一问题,我们引入了CURVE(视频评价中的文化理解和推理),这是一个针对多文化和多语言视频推理的具有挑战性的基准测试。CURVE 包含来自18个全球地区的高质量、完全由人类生成的、针对特定区域文化的视频注释。与以往依赖自动翻译的工作不同,CURVE 提供了复杂的问题、答案和多步推理步骤,所有这些都用当地语言精心制作。要在CURVE上取得进展需要对视觉文化背景有深入的理解。此外,我们利用CURVE的推理痕迹构建基于证据的图表,并提出了一种使用这些图表的新型迭代策略,以识别推理中的细微错误。我们的评估表明,当前最先进的视频大模型面临巨大挑战,在人类水平的准确性上表现明显低于人类,错误主要来自对文化元素的视觉感知。CURVE 将在 https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural 公开可用。
Summary / 总结
The research aims to address the bias in current video understanding benchmarks by introducing CURVE, a new benchmark for multicultural and multilingual video reasoning. CURVE includes high-quality, human-generated annotations from diverse cultural videos across 18 global locales, providing complex questions and answers in native languages. The study finds that state-of-the-art video language models perform poorly on CURVE, with errors mainly due to difficulties in visual perception of cultural elements. The authors propose an iterative strategy using reasoning traces to identify and correct these errors.
研究旨在通过引入CURVE这一新的多文化多语言视频推理基准,解决当前视频理解基准中的偏见问题。CURVE包含来自18个全球地区的高质量、人类生成的注释,涵盖多样化的文化视频,并以本地语言提供复杂的问题和答案。研究发现,最先进的视频语言模型表现不佳,主要错误源于对文化元素的视觉感知。该基准利用推理痕迹来识别具体错误并改进模型。该基准可在https://github.com/google-deepmind/neptune\#minerva-cultural获取。
Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs
Authors: Yuxi Xia, Loris Schoenegger, Benjamin Roth
First: 2026-01-15T18:05:42+00:00 · Latest: 2026-01-15T18:05:42+00:00
Abstract
Large language models (LLMs) can increase users' perceived trust by verbalizing confidence in their outputs. However, prior work has shown that LLMs are often overconfident, making their stated confidence unreliable since it does not consistently align with factual accuracy. To better understand the sources of this verbalized confidence, we introduce TracVC (\textbf{Trac}ing \textbf{V}erbalized \textbf{C}onfidence), a method that builds on information retrieval and influence estimation to trace generated confidence expressions back to the training data. We evaluate TracVC on OLMo and Llama models in a question answering setting, proposing a new metric, content groundness, which measures the extent to which an LLM grounds its confidence in content-related training examples (relevant to the question and answer) versus in generic examples of confidence verbalization. Our analysis reveals that OLMo2-13B is frequently influenced by confidence-related data that is lexically unrelated to the query, suggesting that it may mimic superficial linguistic expressions of certainty rather than rely on genuine content grounding. These findings point to a fundamental limitation in current training regimes: LLMs may learn how to sound confident without learning when confidence is justified. Our analysis provides a foundation for improving LLMs' trustworthiness in expressing more reliable confidence.
中文标题/摘要
标题:有影响力的训练数据检索以解释LLMs的口头化置信度
大型语言模型(LLMs)可以通过口头化其输出的置信度来增加用户的信任感。然而,先前的研究表明,LLMs往往过于自信,使得它们声明的置信度不可靠,因为它并不总是与事实准确性一致。为了更好地理解这种口头化置信度的来源,我们引入了TracVC(追踪口头化置信度)方法,该方法基于信息检索和影响估计,将生成的置信度表达追溯回训练数据。我们在OLMo和Llama模型上以问答设置评估了TracVC,提出了一个新的度量标准,内容相关性,该度量标准衡量LLM在其置信度中基于与问题和答案相关的内容相关训练示例的程度,而不是基于通用的置信度口头化示例。我们的分析表明,OLMo2-13B经常受到与查询在词汇上无关的置信相关数据的影响,这表明它可能模仿表面的确定性语言表达,而不是依赖于真实的内容基础。这些发现指出了当前训练制度的一个根本局限性:LLMs可能学会如何显得自信,而不知道何时置信是合理的。我们的分析为提高LLMs表达更可靠置信度的可信度奠定了基础。
Summary / 总结
The research aims to improve the reliability of large language models' (LLMs) verbalized confidence by tracing back their confidence expressions to the training data. TracVC, a method combining information retrieval and influence estimation, is introduced to analyze the sources of LLMs' confidence. The study evaluates TracVC on OLMo and Llama models and introduces a new metric, content groundness, to measure the extent to which LLMs ground their confidence in relevant training examples. The findings suggest that OLMo2-13B often relies on lexically unrelated confidence-related data, indicating a potential limitation in current training regimes where LLMs may learn superficial confidence expressions without genuine content grounding. This highlights the need for better training to ensure LLMs express more reliable confidence.
研究旨在通过理解大型语言模型(LLMs)口头表达自信的来源来增强其可信度。TracVC 方法结合了信息检索和影响估计,将自信表达追溯到训练数据。对OLMo和Llama模型的评估显示,OLMo2-13B 经常基于与查询无关的数据来表达自信,表明缺乏真正的内容支撑。这表明LLMs 可能学会了表达表面自信,而不是基于合理的自信,突显了当前训练机制中的一个根本局限性。
Adjusted Similarity Measures and a Violation of Expectations
Authors: William L. Lippitt, Edward J. Bedrick, Nichole E. Carlson
First: 2026-01-15T18:01:26+00:00 · Latest: 2026-01-15T18:01:26+00:00
Comments: 12 pages, 1 figure
Abstract
Adjusted similarity measures, such as Cohen's kappa for inter-rater reliability and the adjusted Rand index used to compare clustering algorithms, are a vital tool for comparing discrete labellings. These measures are intended to have the property of 0 expectation under a null distribution and maximum value 1 under maximal similarity to aid in interpretation. Measures are frequently adjusted with respect to the permutation distribution for historic and analytic reasons. There is currently renewed interest in considering other null models more appropriate for context, such as clustering ensembles permitting a random number of identified clusters. The purpose of this work is two -- fold: (1) to generalize the study of the adjustment operator to general null models and to a more general procedure which includes statistical standardization as a special case and (2) to identify sufficient conditions for the adjustment operator to produce the intended properties, where sufficient conditions are related to whether and how observed data are incorporated into null distributions. We demonstrate how violations of the sufficient conditions may lead to substantial breakdown, such as by producing a non-positive measure under traditional adjustment rather than one with mean 0, or by producing a measure which is deterministically 0 under statistical standardization.
中文标题/摘要
标题:调整相似度度量和违反预期
调整相似度度量,如用于评定者间可靠性的科恩κ系数和用于比较聚类算法的调整兰德指数,是对比离散标签的重要工具。这些度量旨在具有在零分布下的期望值为0和在最大相似度下的最大值为1的特性,以利于解释。这些度量通常出于历史和分析原因而进行调整。目前,人们重新对考虑更合适的零模型(如允许随机数量的识别簇的聚类集成)产生了兴趣。本文的目的有两个:(1)将调整操作的研究推广到一般零模型和更广泛的程序,其中统计标准化是特殊情况之一;(2)确定调整操作产生预期特性的充分条件,其中充分条件与观测数据如何被纳入零分布相关。我们展示了充分条件的违反可能导致重大问题,例如在传统调整中产生非正度量而不是均值为0的度量,或者在统计标准化中产生确定为0的度量。
Summary / 总结
This study aims to generalize the adjustment operator for similarity measures to accommodate various null models and to identify conditions under which these measures maintain their intended properties. The research demonstrates that violations of these conditions can lead to significant issues, such as producing non-positive measures under traditional adjustments or deterministic zeros under statistical standardization.
研究旨在将相似性度量的调整操作推广到多种null模型,并确定这些条件以保持度量的预期属性。研究显示,违反这些条件会导致严重问题,例如在传统调整下产生非正度量,或在统计标准化下产生确定为零的度量。
STEM: Scaling Transformers with Embedding Modules
Authors: Ranajoy Sadhukhan, Sheng Cao, Harry Dong, Changsheng Zhao, Attiano Purpura-Pontoniere, Yuandong Tian, Zechun Liu, Beidi Chen
First: 2026-01-15T18:00:27+00:00 · Latest: 2026-01-15T18:00:27+00:00
Abstract
Fine-grained sparsity promises higher parametric capacity without proportional per-token compute, but often suffers from training instability, load balancing, and communication overhead. We introduce STEM (Scaling Transformers with Embedding Modules), a static, token-indexed approach that replaces the FFN up-projection with a layer-local embedding lookup while keeping the gate and down-projection dense. This removes runtime routing, enables CPU offload with asynchronous prefetch, and decouples capacity from both per-token FLOPs and cross-device communication. Empirically, STEM trains stably despite extreme sparsity. It improves downstream performance over dense baselines while reducing per-token FLOPs and parameter accesses (eliminating roughly one-third of FFN parameters). STEM learns embedding spaces with large angular spread which enhances its knowledge storage capacity. More interestingly, this enhanced knowledge capacity comes with better interpretability. The token-indexed nature of STEM embeddings allows simple ways to perform knowledge editing and knowledge injection in an interpretable manner without any intervention in the input text or additional computation. In addition, STEM strengthens long-context performance: as sequence length grows, more distinct parameters are activated, yielding practical test-time capacity scaling. Across 350M and 1B model scales, STEM delivers up to ~3--4% accuracy improvements overall, with notable gains on knowledge and reasoning-heavy benchmarks (ARC-Challenge, OpenBookQA, GSM8K, MMLU). Overall, STEM is an effective way of scaling parametric memory while providing better interpretability, better training stability and improved efficiency.
Summary / 总结
STEM introduces a static, token-indexed approach to scaling Transformers by replacing the FFN up-projection with a layer-local embedding lookup, which removes runtime routing and enables CPU offload. Experiments show that STEM trains stably with extreme sparsity, reduces per-token FLOPs and parameter accesses, and improves downstream performance by up to 4% on various benchmarks. Additionally, STEM enhances interpretability through its token-indexed embeddings, allowing for simple knowledge editing and injection. It also scales test-time capacity with sequence length, particularly benefiting knowledge and reasoning-heavy tasks.
STEM 通过将 FFN 上投影替换为层局部嵌入查找,同时保持门和下投影密集,引入了一种静态的、基于标记的扩展方法。这种方法消除了运行时路由,允许 CPU 卸载,并将容量与每个标记的 FLOPs 和跨设备通信解耦。实验表明,STEM 在极端稀疏性下训练稳定,提高下游性能,减少每个标记的 FLOPs 和参数访问次数,并通过增强的知识存储容量提供更好的可解释性。在 350M 和 1B 模型规模下,STEM 在知识和推理密集型基准测试上实现了高达 4% 的准确性改进。
Dual-Uncertainty Guided Policy Learning for Multimodal Reasoning
Authors: Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, Dong Yu
First: 2025-10-01T20:32:08+00:00 · Latest: 2026-01-15T17:51:14+00:00
Abstract
Reinforcement learning with verifiable rewards (RLVR) has advanced reasoning capabilities in multimodal large language models. However, existing methods typically treat visual inputs as deterministic, overlooking the perceptual ambiguity inherent to the visual modality. Consequently, they fail to distinguish whether a model's uncertainty stems from complex reasoning or ambiguous perception, preventing the targeted allocation of exploration or learning signals. To address this gap, we introduce DUPL, a dual-uncertainty guided policy learning approach for multimodal RLVR that quantifies and leverages both perceptual uncertainty (via symmetric KL divergence) and output uncertainty (via policy entropy) to guide policy updates. By establishing an uncertainty-driven feedback loop and employing a dynamic branch prioritization mechanism, DUPL recalibrates the policy advantage to focus learning on states with high perceptual or decisional ambiguity, enabling effective targeted exploration beyond passive data augmentation. Implemented on top of GRPO and evaluated on six multimodal mathematical and general-domain reasoning benchmarks, DUPL improves Qwen2.5-VL 3B and 7B models, achieving accuracy gains of up to 11.2% on visual math tasks and up to 7.1% on general-domain reasoning tasks, while consistently outperforming GRPO. These results demonstrate that dual-uncertainty guided policy learning is an effective and generalizable approach for multimodal RLVR.
中文标题/摘要
标题:基于双重不确定性指导的多模态推理策略学习
具有可验证奖励的强化学习(RLVR)在多模态大型语言模型中提升了推理能力。然而,现有方法通常将视觉输入视为确定性的,忽视了视觉模态固有的感知模糊性。因此,它们无法区分模型的不确定性是源于复杂的推理还是模糊的感知,从而无法有针对性地分配探索或学习信号。为解决这一问题,我们引入了DUPL,这是一种用于多模态RLVR的双重不确定性指导策略学习方法,通过对称KL散度量化和利用感知不确定性(感知不确定性)以及通过策略熵量化和利用输出不确定性(输出不确定性)来指导策略更新。通过建立一个以不确定性驱动的反馈循环并采用动态分支优先机制,DUPL重新校准策略优势,使其专注于具有高感知或决策模糊性的状态,从而实现有效的目标探索,超越被动的数据增强。DUPL基于GRPO实现,并在六个多模态数学和通用领域推理基准测试上进行评估,提高了Qwen2.5-VL 3B和7B模型的准确性,视觉数学任务上的准确率提升高达11.2%,通用领域推理任务上的准确率提升高达7.1%,并且始终优于GRPO。这些结果表明,双重不确定性指导的策略学习是一种有效且可泛化的多模态RLVR方法。
Summary / 总结
The research aims to enhance the reasoning capabilities of multimodal large language models in reinforcement learning by addressing the issue of perceptual ambiguity in visual inputs. DUPL, a dual-uncertainty guided policy learning approach, quantifies perceptual and output uncertainties to guide policy updates. This method improves Qwen2.5-VL 3B and 7B models, achieving up to 11.2% accuracy gains on visual math tasks and up to 7.1% on general-domain reasoning tasks, outperforming GRPO across six benchmarks.
研究旨在通过解决现有方法将视觉输入视为确定性的局限性,提升多模态大型语言模型的推理能力。DUPL是一种双重不确定性引导的策略学习方法,量化感知和输出不确定性以指导策略更新,重点关注高不确定性的状态。该方法在视觉数学任务中将Qwen2.5-VL 3B和7B模型的准确率提高了最多11.2%,在一般领域推理任务中提高了最多7.1%,并在六个基准测试中优于GRPO。
On the Failure of Latent State Persistence in Large Language Models
Authors: Jen-tse Huang, Kaiser Sun, Wenxuan Wang, Mark Dredze
First: 2025-04-30T16:18:39+00:00 · Latest: 2026-01-15T17:44:56+00:00
Comments: 8 pages, 6 figures, 9 tables
Abstract
While Large Language Models (LLMs) excel in reasoning, whether they can sustain persistent latent states remains under-explored. The capacity to maintain and manipulate unexpressed, internal representations-analogous to human working memory-is a cornerstone of complex reasoning. In this paper, we formalize and quantify the "Latent State Persistence" (LSP) gap through three novel experiments. First, we utilize a Number Guessing Game, demonstrating that across independent queries, LLMs fail to allocate probability mass to a singular hidden choice, violating a fundamental probabilistic principle. Second, we employ a Yes-No Game to show that as the number of questions increases, LLMs suffer from "concept drift," leading to inevitable self-contradictions due to the lack of LSP. Finally, inspired by Mathematical Mentalism, we task models with tracking transformations on hidden variables, revealing a failure in variable binding and state evolution when the initial state is not explicitly present in the context. Collectively, these findings suggest that LLMs function as reactive post-hoc solvers rather than proactive planners with LSP. Our work provides a framework for evaluating the fidelity of internal representations and highlights a fundamental architectural divergence between autoregressive transformers and human-like cognition.
中文标题/摘要
标题:关于大型语言模型中潜在状态持久性的失败
虽然大型语言模型(LLMs)在推理方面表现出色,但它们能否维持持久的潜在状态仍是一个未被充分探索的问题。维持和操作未表达的内部表示——类似于人类的工作记忆——是复杂推理的基础。在本文中,我们通过三个新的实验正式化并量化了“潜在状态持久性”(LSP)差距。首先,我们使用一个数字猜谜游戏,证明LLMs在独立查询中无法将概率质量分配给单一隐藏选择,违反了基本的概率原则。其次,我们使用一个是/否游戏来展示随着问题数量的增加,LLMs会遭受“概念漂移”,导致由于缺乏LSP而不可避免地产生自相矛盾。最后,受数学心灵主义的启发,我们要求模型跟踪隐藏变量的变换,揭示了当初始状态未明确出现在上下文中时,变量绑定和状态演变的失败。这些发现共同表明,LLMs作为反应性的事后解决者而非具有LSP的前瞻性规划者运作。我们的工作提供了一个评估内部表示保真的框架,并突显了自回归变换器与人类认知之间基本的架构差异。
Summary / 总结
This paper investigates the capability of Large Language Models (LLMs) to maintain persistent latent states, which is crucial for complex reasoning. Through three experiments—Number Guessing Game, Yes-No Game, and tracking transformations—the study reveals that LLMs fail to sustain internal representations across queries, leading to probabilistic inconsistencies and self-contradictions. The findings suggest that LLMs operate more as reactive solvers than proactive planners with persistent internal states.
本文研究了大型语言模型(LLMs)维持持久潜状态的能力,这对于复杂的推理至关重要。通过三个实验——数字猜谜游戏、是或否游戏和数学心灵主义任务,作者展示了LLMs无法在查询之间维持内部表示、表现出概念漂移,并且在处理隐藏变量的变换时存在困难,表明它们更像是反应性的后验求解器而非前瞻性的规划者。这些发现突显了LLMs在维持类似人类工作记忆的内部状态方面存在显著差距。
Can LLMs Understand What We Cannot Say? Measuring Multilevel Alignment Through Abortion Stigma Across Cognitive, Interpersonal, and Structural Levels
Authors: Anika Sharma, Malavika Mampally, Chidaksh Ravuru, Kandyce Brennan, Neil Gaikwad
First: 2025-12-15T09:50:00+00:00 · Latest: 2026-01-15T17:43:09+00:00
Abstract
As Large Language Models (LLMs) increasingly mediate stigmatized health decisions, their capacity to understand complex psychological phenomena remains inadequately assessed. Can LLMs understand what we cannot say? We investigate whether LLMs coherently represent abortion stigma across cognitive, interpersonal, and structural levels. We systematically tested 627 demographically diverse personas across five leading LLMs using the validated Individual Level Abortion Stigma Scale (ILAS), examining representation at cognitive (self-judgment), interpersonal (worries about judgment and isolation), and structural (community condemnation and disclosure patterns) levels. Models fail tests of genuine understanding across all dimensions. They underestimate cognitive stigma while overestimating interpersonal stigma, introduce demographic biases assigning higher stigma to younger, less educated, and non-White personas, and treat secrecy as universal despite 36% of humans reporting openness. Most critically, models produce internal contradictions: they overestimate isolation yet predict isolated individuals are less secretive, revealing incoherent representations. These patterns show current alignment approaches ensure appropriate language but not coherent understanding across levels. This work provides empirical evidence that LLMs lack coherent understanding of psychological constructs operating across multiple dimensions. AI safety in high-stakes contexts demands new approaches to design (multilevel coherence), evaluation (continuous auditing), governance and regulation (mandatory audits, accountability, deployment restrictions), and AI literacy in domains where understanding what people cannot say determines whether support helps or harms.
中文标题/摘要
标题:大型语言模型能否理解我们无法表达的内容?通过堕胎污名在认知、人际和结构层面的测量来考察多层面一致性
随着大型语言模型(LLMs)越来越多地参与敏感健康决策,它们理解复杂心理现象的能力仍被严重低估。LLMs能否理解我们无法表达的内容?我们研究LLMs是否在认知、人际和结构层面一致地代表堕胎污名。我们使用验证过的个体层面堕胎污名量表(ILAS)系统地测试了5种主流LLM中的627个不同背景的人格,考察了认知(自我判断)、人际(担心被评判和孤立)和结构(社区谴责和披露模式)层面的代表性。模型在所有维度上都未能通过真正理解的测试。它们低估了认知污名,而高估了人际污名,引入了人口统计学偏见,将更高污名赋予年轻、教育程度较低和非白人的人格,并将保密视为普遍现象,尽管36%的人报告了开放性。最关键的是,模型产生了内部矛盾:它们高估了孤立,但预测孤立个体更不保密,揭示了不一致的表示。这些模式表明当前的一致性方法确保了适当的语言,但没有跨层面的一致理解。本研究提供了实证证据,证明LLMs缺乏跨多个维度运作的心理构念的一致理解。在高风险情境下的AI安全需要新的设计(多层面一致性)、评估(持续审计)、治理和监管(强制审计、问责、部署限制)以及AI素养方法,特别是在理解人们无法表达的内容时,支持是帮助还是伤害决定了结果。
Summary / 总结
The study investigates whether Large Language Models (LLMs) can understand complex psychological phenomena like abortion stigma across cognitive, interpersonal, and structural levels. Using the Individual Level Abortion Stigma Scale (ILAS), the research tested 627 personas across five LLMs and found that models fail to demonstrate genuine understanding, underestimating cognitive stigma and overestimating interpersonal stigma. They also introduce demographic biases and produce internal contradictions, indicating a lack of coherent understanding across levels. This suggests current alignment approaches are insufficient for ensuring comprehensive understanding in high-stakes contexts.
本研究评估了大型语言模型(LLMs)是否能够理解复杂的心理现象,特别是跨认知、人际和结构层面的堕胎污名。通过使用个体水平堕胎污名量表,研究测试了五个领先LLM的627个不同背景的人格,并发现模型未能展示真正的理解,低估了认知污名,高估了人际污名,并引入了人口统计学偏见。模型还产生了内部矛盾,揭示了污名的不连贯表示。这表明当前的对齐方法确保了适当的语言表达,但没有跨层面的连贯理解,强调了在高风险领域中需要新的AI设计、评估、治理和监管方法,以及AI素养的提升,特别是在理解人们无法表达的内容时,这决定了支持是否有助于或伤害了他们。
Explicit Abstention Knobs for Predictable Reliability in Video Question Answering
Authors: Jorge Ortiz
First: 2025-12-31T23:27:32+00:00 · Latest: 2026-01-15T17:31:17+00:00
Comments: Preprint. Diagnostic study of confidence-based abstention under evidence truncation
Abstract
High-stakes deployment of vision-language models (VLMs) requires selective prediction, where systems abstain when uncertain rather than risk costly errors. We investigate whether confidence-based abstention provides reliable control over error rates in video question answering, and whether that control remains robust under distribution shift. Using NExT-QA and Gemini 2.0 Flash, we establish two findings. First, confidence thresholding provides mechanistic control in-distribution. Sweeping threshold epsilon produces smooth risk-coverage tradeoffs, reducing error rates f
中文标题/摘要
标题:显式弃权开关以实现视频问答中的可预测可靠性
在高风险部署视觉-语言模型(VLMs)时,需要选择性预测,即系统在不确定时弃权而不是冒高成本错误的风险。我们研究了基于信心的弃权是否能提供对错误率的可靠控制,以及这种控制在分布偏移下是否保持稳健。使用NExT-QA和Gemini 2.0 Flash,我们得出两项发现。首先,置信度阈值化在同分布下提供了机制控制。扫掠阈值ε产生平滑的风险-覆盖率权衡,降低错误率
Summary / 总结
The research aims to improve the reliability of vision-language models in high-stakes applications by enabling them to abstain from making predictions when uncertain. The study investigates the effectiveness of confidence-based abstention in controlling error rates in video question answering. Key findings show that confidence thresholding offers reliable control over error rates within the training distribution, with smooth risk-coverage tradeoffs as the threshold is varied.
研究旨在通过使视觉-语言模型能够在高风险应用场景中进行选择性预测来提高其可靠性。研究探讨了使用基于信心的回避来控制问答视频中的错误率。关键发现表明,信心阈值提供了在训练分布内可靠地控制错误率的能力,随着阈值的变化,风险-覆盖率曲线呈现出平滑的折衷关系。
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Authors: Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna
First: 2026-01-15T17:27:44+00:00 · Latest: 2026-01-15T17:27:44+00:00
Abstract
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
中文标题/摘要
标题:Molmo2:开放权重和数据的视觉-语言模型,具备视频理解与定位能力
当前最强的视频-语言模型(VLMs)仍为私有。最强的开放权重模型要么依赖于私有VLMs的合成数据,有效从中提炼,要么不披露其训练数据或方法。因此,开源社区缺乏改进当前最先进的视频(和图像)语言模型的基础。至关重要的是,许多下游应用不仅需要高层次的视频理解,还需要定位——无论是通过指针还是通过像素跟踪。即使私有模型也缺乏这种能力。我们提出了Molmo2,这是一种新的VLM家族,是开源模型中的最先进的,并展示了在单图像、多图像和视频任务中出色的基于指针的定位新能力。我们的主要贡献是一系列7个新的视频数据集和2个多图像数据集,包括用于预训练的详细视频字幕数据集、自由形式的视频问答数据集、新的具有复杂查询的对象跟踪数据集以及创新的视频指针数据集,所有这些数据集均未使用封闭的VLMs收集。我们还提供了一种利用高效打包和消息树编码方案的数据训练配方,并展示了在视觉标记上进行双向注意以及一种新的标记权重策略可以提高性能。我们最好的8B模型在短视频、计数和字幕方面优于其他开放权重和数据模型,并在长视频方面具有竞争力。在视频定位方面,Molmo2显著优于现有开放权重模型如Qwen3-VL(视频计数准确率为35.5 vs 29.6)并在某些任务上超越了私有模型如Gemini 3 Pro(视频指针F1得分为38.4 vs 20.0,视频跟踪J&F得分为56.2 vs 41.1)。
Summary / 总结
The paper introduces Molmo2, a new family of open-source vision-language models that outperform other open-source models in video understanding and grounding tasks. The authors provide 9 new datasets, including video captions, Q&A, object tracking, and pointing datasets, and a training recipe that includes efficient packing and message-tree encoding. The best-in-class 8B model excels in short video tasks and is competitive on long videos. Molmo2 significantly outperforms existing open-source models and even surpasses some proprietary models in video-grounding tasks.
论文介绍了Molmo2,这是一种新的开源视觉-语言模型,其在视频理解和定位任务上优于其他开源模型。作者提供了9个新数据集,包括视频字幕、问答、物体跟踪和指针数据集,并提供了一种高效的打包和消息树编码的训练方法。最好的8B模型在短视频任务上表现出色,并且在长视频任务上具有竞争力。Molmo2在视频定位任务上显著优于现有的开源模型,甚至在某些任务上超过了部分专有模型。
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
Authors: Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi
First: 2026-01-13T17:48:43+00:00 · Latest: 2026-01-15T17:24:46+00:00
Comments: Work in Progress
Abstract
Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.
中文标题/摘要
标题:奖励稀有性:面向创造性问题解决的LLMs独特性意识强化学习
强化学习(RL)已成为后训练大型语言模型(LLMs)的核心范式,特别是在复杂推理任务中,但它经常遭受探索崩溃的问题:策略过早地集中在一小套主导的推理模式上,提高了pass@1,但限制了rollout级别的多样性以及pass@k的收益。我们认为这种失败源于对局部token行为的正则化,而不是对解决方案集的多样性。为了解决这个问题,我们提出了独特性意识强化学习,这是一种rollout级别的目标,明确奖励那些表现出罕见高级策略的正确解决方案。该方法使用基于LLM的裁判将相同问题的rollout根据其高级解决方案策略聚类,忽略表面差异,并根据集群大小反向重新加权策略优势。因此,正确但新颖的策略比冗余策略获得更高的奖励。在数学、物理和医学推理基准测试中,我们的方法在大规模采样预算下始终如一地提高了pass@$k$,并增加了pass@$k$曲线下的面积(AUC@$K$),同时不牺牲pass@1,同时保持探索并揭示更多多样化的解决方案策略。
Summary / 总结
The paper addresses the issue of exploration collapse in reinforcement learning for large language models, where policies tend to focus on a few dominant reasoning patterns. It introduces Uniqueness-Aware Reinforcement Learning, which rewards rare high-level strategies to promote diversity. The method uses an LLM-based judge to cluster rollouts based on high-level solution strategies and reweights policy advantages inversely with cluster size. Experiments show consistent improvements in pass@$k$ across various reasoning benchmarks, increasing the AUC@$K$ without sacrificing pass@1, and uncovering more diverse solution strategies.
论文针对强化学习中大型语言模型探索崩溃的问题,即政策倾向于聚焦于少数主导推理模式。提出了一种新颖的Uniqueness-Aware强化学习方法,通过奖励罕见的高阶策略来促进多样性。该方法使用基于LLM的评判器根据高阶解策略对策略进行聚类,并按聚类大小的倒数重新加权策略优势。实验结果显示,该方法在数学、物理和医学推理基准测试中提高了pass@$k$,增加了AUC@$K$,同时没有牺牲pass@1,并且在大规模采样预算下发现了更多多样化的解策略。
Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models
Authors: Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, Sjoerd van Steenkiste
First: 2025-03-21T20:13:04+00:00 · Latest: 2026-01-15T17:21:57+00:00
Comments: Nature Communications
Abstract
Large language models (LLMs) are increasingly used as agents that interact with users and with the world. To do so successfully, LLMs must construct representations of the world and form probabilistic beliefs about them. To provide personalized recommendations, for example, the LLM needs to infer a user's preferences from their behavior over multiple interactions. The Bayesian inference framework lays out the optimal way for an agent to update its beliefs as it receives new information. We first show that LLMs fall far short of the standard defined by the Bayesian framework. We then show that by teaching LLMs to mimic the predictions of the normative Bayesian model, we can dramatically improve their ability to update their beliefs; this ability generalizes to new tasks. We conclude that LLMs can effectively learn reasoning skills from examples and generalize those skills to new domains.
中文标题/摘要
标题:贝叶斯教学使大型语言模型具备概率推理能力
大型语言模型(LLMs)越来越多地被用作与用户和世界互动的代理。为了成功地做到这一点,LLMs 必须构建对世界的表示,并形成关于它们的概率性信念。例如,为了提供个性化的推荐,LLM 需要从用户在多次交互中的行为中推断出用户的偏好。贝叶斯推理框架列出了代理在接收到新信息时更新其信念的最佳方式。我们首先表明,LLMs 在达到贝叶斯框架设定的标准方面远远不够。然后我们表明,通过教导 LLMs 模仿规范性贝叶斯模型的预测,我们可以显著提高它们更新信念的能力;这种能力可以泛化到新的任务。我们得出结论,LLMs 可以从示例中有效学习推理技能,并将这些技能泛化到新的领域。
Summary / 总结
The research aims to enhance the probabilistic reasoning capabilities of large language models (LLMs) by aligning them with Bayesian inference principles. The study demonstrates that LLMs currently lack the ability to update their beliefs effectively based on new information. By training LLMs to mimic the predictions of a normative Bayesian model, the researchers significantly improved the models' belief updating abilities, which generalized across different tasks. This suggests that LLMs can learn reasoning skills from examples and apply them to new domains.
研究旨在通过使大型语言模型(LLMs)符合贝叶斯推理原则来增强其概率推理能力。方法是通过训练LLMs模仿规范的贝叶斯模型的预测,从而显著提高它们根据新信息更新信念的能力。关键发现包括,这种方法使LLMs能够在新任务和领域中更好地泛化其推理技能。
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
Authors: Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park
First: 2026-01-14T17:57:43+00:00 · Latest: 2026-01-15T17:20:36+00:00
Comments: Work in Progress
Abstract
Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.
中文标题/摘要
标题:协作多智能体推理的测试时强化学习
多智能体系统已成为许多应用中的实用LLM驱动合作者,通过多样性和交叉验证获得鲁棒性。然而,多智能体强化学习(MARL)训练资源密集且不稳定:队友的共同适应引入了非平稳性,奖励通常稀疏且高方差。因此,我们引入了**多智能体测试时强化学习(MATTRL)**框架,在推理时向多智能体讨论注入结构化的文本经验。MATTRL 形成一个由专家组成的多专家团队进行多轮讨论,检索并整合测试时的经验,并达成共识进行最终决策。我们还研究了信用分配方法以构建轮次级经验池,然后将其重新注入对话中。在医学、数学和教育领域的具有挑战性的基准测试中,MATTRL 的准确率平均提高了3.67%,相对于多智能体基线提高了8.67%,相对于可比的单智能体基线提高了8.67%。消融研究探讨了不同的信用分配方案,并详细比较了它们对训练结果的影响。MATTRL 提供了一条稳定、有效且高效的路径,无需调整即可实现分布转移鲁棒的多智能体推理。
Summary / 总结
The paper introduces Multi-Agent Test-Time Reinforcement Learning (MATTRL), which injects structured textual experience into multi-agent deliberation at inference time to improve robustness and accuracy. MATTRL forms a multi-expert team for discussions, retrieves and integrates test-time experiences, and reaches consensus for decision-making. Experiments show that MATTRL improves accuracy by 3.67% over a multi-agent baseline and by 8.67% over single-agent baselines across various benchmarks in medicine, math, and education. Ablation studies evaluate different credit-assignment schemes and their impact on training outcomes.
论文提出了Multi-Agent Test-Time Reinforcement Learning (MATTRL),该方法在推理时注入结构化的文本经验,以提高鲁棒性和准确性。MATTRL 形成一个多专家团队进行讨论,检索和整合测试时的经验,并达成共识进行决策。实验结果显示,MATTRL 在医学、数学和教育等领域的基准测试中,相对于多代理基线提高了 3.67% 的准确性,相对于单代理基线提高了 8.67% 的准确性。消融研究评估了不同信用分配方案及其对训练结果的影响。
Procedural Fairness in Multi-Agent Bandits
Authors: Joshua Caiata, Carter Blair, Kate Larson
First: 2026-01-15T17:11:51+00:00 · Latest: 2026-01-15T17:11:51+00:00
Abstract
In the context of multi-agent multi-armed bandits (MA-MAB), fairness is often reduced to outcomes: maximizing welfare, reducing inequality, or balancing utilities. However, evidence in psychology, economics, and Rawlsian theory suggests that fairness is also about process and who gets a say in the decisions being made. We introduce a new fairness objective, procedural fairness, which provides equal decision-making power for all agents, lies in the core, and provides for proportionality in outcomes. Empirical results confirm that fairness notions based on optimizing for outcomes sacrifice equal voice and representation, while the sacrifice in outcome-based fairness objectives (like equality and utilitarianism) is minimal under procedurally fair policies. We further prove that different fairness notions prioritize fundamentally different and incompatible values, highlighting that fairness requires explicit normative choices. This paper argues that procedural legitimacy deserves greater focus as a fairness objective, and provides a framework for putting procedural fairness into practice.
中文标题/摘要
标题:多智能体多臂老虎机中的程序公平性
在多智能体多臂老虎机(MA-MAB)的背景下,公平性通常被简化为结果:最大化福利、减少不平等或平衡效用。然而,心理学、经济学和罗尔斯理论的证据表明,公平性还关乎过程以及谁在决策中拥有发言权。我们引入了一个新的公平性目标——程序公平性,它为所有智能体提供了平等的决策权,处于核心位置,并确保结果的成比例性。实证结果表明,基于优化结果的公平性观念牺牲了平等的声音和代表性,而基于结果的公平性目标(如平等和功利主义)在程序公平政策下的牺牲最小。我们进一步证明,不同的公平性观念优先考虑的是根本不同且不兼容的价值观,突显了公平性需要明确规范性选择。本文认为,程序合法性作为公平性目标应得到更多关注,并提供了一个将程序公平性付诸实践的框架。
Summary / 总结
This paper introduces a new fairness objective called procedural fairness in the context of multi-agent multi-armed bandits, which emphasizes equal decision-making power for all agents. The empirical results show that fairness based on optimizing outcomes can sacrifice equal voice and representation, while procedural fairness minimally impacts outcome-based fairness objectives like equality and utilitarianism. The study also demonstrates that different fairness notions prioritize different and incompatible values, emphasizing the need for explicit normative choices in fairness objectives.
该论文在多代理多臂老虎机中引入了一种新的公平性目标——程序公平性,确保所有代理拥有平等的决策权。研究发现,基于优化结果的公平性往往会牺牲平等的发言权和代表性,而程序公平性政策对结果的影响最小。研究还表明,不同的公平性观念优先考虑的是相互冲突的价值观,强调在公平性目标中需要明确规范性的选择。