arXiv 论文速递

2026-01-19 03:20
Snapshot: 20260119_0320
Alterbute: Editing Intrinsic Attributes of Objects in Images
Authors: Tal Reiss, Daniel Winter, Matan Cohen, Alex Rav-Acha, Yael Pritch, Ariel Shamir, Yedid Hoshen
First: 2026-01-15T18:59:53+00:00 · Latest: 2026-01-15T18:59:53+00:00
Comments: Project page is available at https://talreiss.github.io/alterbute/
Abstract
We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ''Porsche 911 Carrera'') that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.
中文标题/摘要
标题:Alterbute:图像中对象固有属性的编辑
我们介绍了Alterbute,一种基于扩散的方法,用于编辑图像中对象的固有属性。我们允许更改对象的颜色、纹理、材质,甚至形状,同时保持其感知身份和场景上下文。现有方法要么依赖于难以维持身份的无监督先验,要么使用过于严格的监督,这限制了有意义的固有属性变化。我们的方法依赖于:(i) 放松的训练目标,允许模型在参考身份图像、描述目标固有属性的文本提示以及定义外部上下文的背景图像和对象掩码的条件下,同时改变固有属性和外部属性;(ii) 视觉命名实体(VNEs)——细粒度的视觉身份类别(例如,“保时捷911卡雷拉”),这些类别将具有身份定义特征的对象分组,同时允许固有属性的变化。我们使用视觉语言模型从大型公共图像数据集中自动提取VNE标签和固有属性描述,从而实现可扩展的、保持身份的监督。Alterbute在保持身份的对象固有属性编辑方面优于现有方法。
Summary / 总结
The research introduces Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image, such as color, texture, and material, while preserving the object's identity and scene context. The method uses a relaxed training objective and Visual Named Entities to allow changes in intrinsic attributes while maintaining extrinsic context. Experiments show that Alterbute outperforms existing methods in preserving object identity during intrinsic attribute editing.
Alterbute 是一种基于扩散的方法,用于编辑图像中对象的内在属性,如颜色、纹理和材料,同时保持对象的身份和场景上下文。它使用宽松的训练目标和视觉命名实体(VNEs)来允许内在属性的变化,同时保持外在属性的一致性。该方法在保持对象身份的同时,优于现有方法进行内在属性编辑。
MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching
Authors: Changle Qu, Sunhao Dai, Hengyi Cai, Jun Xu, Shuaiqiang Wang, Dawei Yin
First: 2026-01-15T18:59:23+00:00 · Latest: 2026-01-15T18:59:23+00:00
Abstract
Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.
中文标题/摘要
标题:MatchTIR:通过二分匹配实现细粒度监督的工具集成推理
工具集成推理(TIR)通过交替进行推理步骤和外部工具交互,赋予大型语言模型(LLMs)处理复杂任务的能力。然而,现有的强化学习方法通常依赖于结果或轨迹级别的奖励,对轨迹内的所有步骤分配相同的优点。这种粗粒度的信用分配无法区分有效的工具调用与冗余或错误的调用,特别是在长时间多轮场景中。为了解决这个问题,我们提出了MatchTIR框架,通过基于二分匹配的轮次级别奖励分配和双层优势估计引入细粒度监督。具体来说,我们将信用分配形式化为预测和真实轨迹之间的二分匹配问题,利用两种分配策略推导密集的轮次级别奖励。此外,为了平衡局部步骤精度与全局任务成功,我们引入了一种双层优势估计方案,结合轮次级别和轨迹级别的信号,为每个交互轮次分配不同的优势值。在三个基准上的广泛实验表明,MatchTIR具有优越性。值得注意的是,我们的4B模型在大多数8B竞争对手中表现更优,特别是在长时间多轮任务中。我们的代码可在https://github.com/quchangle1/MatchTIR获取。
Summary / 总结
MatchTIR is designed to enhance Tool-Integrated Reasoning (TIR) by providing fine-grained supervision through bipartite matching-based turn-level reward assignment and dual-level advantage estimation. This method distinguishes effective tool calls from redundant ones, especially in long-horizon multi-turn scenarios. Experiments on three benchmarks show that MatchTIR outperforms most 8B competitors, particularly in long-horizon and multi-turn tasks.
MatchTIR 通过二分匹配分配细粒度的回合级奖励,区分有效的工具调用和冗余的调用,以增强大型语言模型的工具集成推理(TIR)。该方法引入了双层优势估计方案,以平衡局部精度和全局任务成功。实验表明,MatchTIR 在三个基准测试中优于大多数 8B 竞争对手,尤其是在长周期和多回合任务中。
From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion
Authors: Cheng Chen, Yuyu Guo, Pengpeng Zeng, Jingkuan Song, Peng Di, Hang Yu, Lianli Gao
First: 2026-01-15T18:59:10+00:00 · Latest: 2026-01-15T18:59:10+00:00
Abstract
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.
中文标题/摘要
标题:从一对一到多对多:动态跨层注入以实现深度视觉-语言融合
视觉-语言模型(VLMs)通过使用粗略的、不对称的连接,仅将视觉编码器的输出链接到大型语言模型(LLM)的输入,从而造成严重的视觉特征瓶颈。这种静态架构从根本上限制了LLM实现层次视觉知识全面对齐的能力,削弱了它们将局部细节与全局语义整合到连贯推理中的能力。为了解决这一问题,我们引入了跨层注入(CLI),这是一种新颖且轻量级的框架,它在两种模态之间构建了一种动态的多对多桥梁。CLI 包含两个协同工作的、参数高效的组件:自适应多投影(AMP)模块,该模块协调来自不同视觉层的特征,以及自适应门控融合(AGF)机制,该机制使LLM能够根据其实时解码上下文选择性地注入最相关的视觉信息。我们通过将CLI整合到LLaVA-OneVision和LLaVA-1.5中来验证其有效性和灵活性。在18个多样基准上的广泛实验表明,CLI 显著提高了性能,确立了CLI 作为一种可扩展范式,通过赋予LLM按需访问完整视觉层次结构的能力,解锁了更深层次的多模态理解。
Summary / 总结
The paper addresses the limitations of static vision-language models (VLMs) by proposing Cross-Layer Injection (CLI), a dynamic framework that facilitates a many-to-many connection between visual and language modalities. CLI includes an Adaptive Multi-Projection (AMP) module for feature harmonization and an Adaptive Gating Fusion (AGF) mechanism for selective visual information injection. Experiments show CLI improves performance on 18 diverse benchmarks, enhancing LLMs' ability to integrate visual and textual information for coherent reasoning.
论文提出了一种动态框架Cross-Layer Injection (CLI),以解决静态视觉-语言模型(VLMs)的限制,CLI允许视觉和语言模态之间实现多对多连接。CLI包括一个自适应多投影(AMP)模块进行特征谐调,以及一个自适应门控融合(AGF)机制进行实时解码上下文下的视觉信息选择性注入。实验表明CLI在18个不同的基准上显著提高了性能,增强了视觉细节与全局语义之间的对齐。
See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection
Authors: Amir Mallak, Erfan Aasi, Shiva Sreeram, Tsun-Hsuan Wang, Daniela Rus, Alaa Maalouf
First: 2026-01-15T18:58:33+00:00 · Latest: 2026-01-15T18:58:33+00:00
Abstract
Recent advances in end-to-end autonomous driving show that policies trained on patch-aligned features extracted from foundation models generalize better to Out-of-Distribution (OOD). We hypothesize that due to the self-attention mechanism, each patch feature implicitly embeds/contains information from all other patches, represented in a different way and intensity, making these descriptors highly redundant. We quantify redundancy in such (BLIP2) features via PCA and cross-patch similarity: $90$% of variance is captured by $17/64$ principal components, and strong inter-token correlations are pervasive. Training on such overlapping information leads the policy to overfit spurious correlations, hurting OOD robustness. We present Stochastic-Patch-Selection (SPS), a simple yet effective approach for learning policies that are more robust, generalizable, and efficient. For every frame, SPS randomly masks a fraction of patch descriptors, not feeding them to the policy model, while preserving the spatial layout of the remaining patches. Thus, the policy is provided with different stochastic but complete views of the (same) scene: every random subset of patches acts like a different, yet still sensible, coherent projection of the world. The policy thus bases its decisions on features that are invariant to which specific tokens survive. Extensive experiments confirm that across all OOD scenarios, our method outperforms the state of the art (SOTA), achieving a $6.2$% average improvement and up to $20.4$% in closed-loop simulations, while being $2.4\times$ faster. We conduct ablations over masking rates and patch-feature reorganization, training and evaluating 9 systems, with 8 of them surpassing prior SOTA. Finally, we show that the same learned policy transfers to a physical, real-world car without any tuning.
中文标题/摘要
标题:少看,开好车:基于基础模型的随机补丁选择实现通用端到端自动驾驶
端到端自动驾驶的最新进展表明,从基础模型提取的补丁对齐特征训练出的策略在分布外(OOD)场景下表现更好。我们假设由于自注意力机制,每个补丁特征隐含地嵌入/包含来自其他所有补丁的信息,以不同的方式和强度表示,使得这些描述符高度冗余。我们通过主成分分析(PCA)和跨补丁相似性量化了(BLIP2)特征中的冗余性:90%的方差由17/64个主成分捕获,且跨标记相关性很强。在这样的重叠信息上进行训练会导致策略过度拟合虚假的相关性,损害了OOD鲁棒性。我们提出了随机补丁选择(SPS),这是一种简单而有效的方法,用于学习更鲁棒、更通用且更高效的策略。对于每一帧,SPS随机遮蔽一部分补丁描述符,不将其提供给策略模型,同时保持剩余补丁的空间布局。因此,策略获得了不同但完整的(相同)场景视图:每个随机补丁子集像一个不同的、仍然合理的世界投影。策略基于不变于哪些特定标记存活的特征做出决策。大量实验表明,我们的方法在所有OOD场景中均优于现有最佳方法(SOTA),平均改进6.2%,在闭环模拟中最高可达20.4%,同时速度提高2.4倍。我们在遮蔽率和补丁特征重组、训练和评估9个系统中进行了消融研究,其中8个系统超越了先前的SOTA。最后,我们展示了相同的策略在无需调整的情况下可以转移到物理的、真实世界的汽车上。
Grounding Agent Memory in Contextual Intent
Authors: Ruozhen Yang, Yucheng Jiang, Yueqi Jiang, Priyanka Kargupta, Yunyi Zhang, Jiawei Han
First: 2026-01-15T18:55:13+00:00 · Latest: 2026-01-15T18:55:13+00:00
Abstract
Deploying large language models in long-horizon, goal-oriented interactions remains challenging because similar entities and facts recur under different latent goals and constraints, causing memory systems to retrieve context-mismatched evidence. We propose STITCH (Structured Intent Tracking in Contextual History), an agentic memory system that indexes each trajectory step with a structured retrieval cue, contextual intent, and retrieves history by matching the current step's intent. Contextual intent provides compact signals that disambiguate repeated mentions and reduce interference: (1) the current latent goal defining a thematic segment, (2) the action type, and (3) the salient entity types anchoring which attributes matter. During inference, STITCH filters and prioritizes memory snippets by intent compatibility, suppressing semantically similar but context-incompatible history. For evaluation, we introduce CAME-Bench, a benchmark for context-aware retrieval in realistic, dynamic, goal-oriented trajectories. Across CAME-Bench and LongMemEval, STITCH achieves state-of-the-art performance, outperforming the strongest baseline by 35.6%, with the largest gains as trajectory length increases. Our analysis shows that intent indexing substantially reduces retrieval noise, supporting intent-aware memory for robust long-horizon reasoning.
中文标题/摘要
标题:将代理记忆扎根于上下文意图
在长期目标导向的交互中部署大型语言模型仍然具有挑战性,因为相似的实体和事实会在不同的潜在目标和约束下反复出现,导致记忆系统检索到上下文不匹配的证据。我们提出了STITCH(上下文历史中的结构化意图跟踪),这是一种代理记忆系统,通过结构化的检索提示、上下文意图对每个轨迹步骤进行索引,并通过匹配当前步骤的意图来检索历史。上下文意图提供了紧凑的信号,消除了重复提及的歧义并减少干扰:(1) 当前定义主题段落的潜在目标,(2) 动作类型,以及(3) 重要的实体类型,锚定哪些属性是相关的。在推理过程中,STITCH通过意图兼容性筛选和优先级排序记忆片段,抑制语义相似但上下文不兼容的历史记录。 为了评估,我们引入了CAME-Bench,这是一个基准测试,用于在现实、动态的目标导向轨迹中进行上下文感知检索。在CAME-Bench和LongMemEval上,STITCH达到了最先进的性能,比最强基线高出35.6%,随着轨迹长度的增加,性能提升最大。我们的分析表明,意图索引显著减少了检索噪声,支持意图感知的记忆以实现稳健的长期推理。
Summary / 总结
The research aims to address the challenge of deploying large language models in long-term goal-oriented interactions by improving memory systems. STITCH, a structured intent tracking system, indexes each step with a contextual intent, which includes the current latent goal, action type, and salient entity types. This method reduces memory retrieval interference and enhances performance. STITCH outperforms the strongest baseline by 35.6% on CAME-Bench and LongMemEval, with the largest improvements seen in longer trajectories.
研究旨在通过改进记忆系统来解决大型语言模型在长期目标导向交互中的部署难题。STITCH 是一种结构化的意图跟踪系统,它将每个步骤与上下文意图关联,包括当前的潜在目标、动作类型和关键实体类型。这种方法减少了检索干扰,提升了性能。STITCH 在 CAME-Bench 和 LongMemEval 上的表现优于最强基线 35.6%,特别是在较长的轨迹中表现更佳。
LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals
Authors: Gilat Toker, Nitay Calderon, Ohad Amosy, Roi Reichart
First: 2026-01-15T18:54:50+00:00 · Latest: 2026-01-15T18:54:50+00:00
Abstract
Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.
中文标题/摘要
标题:LIBERTy:基于因果框架的概念驱动解释基准测试方法,使用结构性反事实
概念驱动的解释量化了高层概念(例如,性别或经验)对模型行为的影响,这对于高风险领域的决策者至关重要。近期工作通过将这些解释与从反事实中估计的参考因果效应进行比较,来评估其忠实性。实际上,现有的基准测试依赖于昂贵的人工编写的反事实,这些反事实作为不完美的代理。为了解决这一问题,我们提出了一种构建包含结构性反事实配对的数据集的框架:LIBERTy(基于LLM的解释性基准测试,具有参考目标的干预性因果模型)。LIBERTy基于明确定义的结构因果模型(SCM),文本生成中的干预传播通过SCM,直到LLM生成反事实。我们引入了三个数据集(疾病检测、CV筛选和工作场所暴力预测)以及一个新的评估指标,顺序忠实性。使用这些数据集,我们评估了五种模型的广泛方法,并确定了改进概念驱动解释的大量空间。LIBERTy还使系统分析模型对干预的敏感性成为可能:我们发现专有LLM在人口统计概念上的敏感性明显降低,这可能是由于后训练缓解措施所致。总体而言,LIBERTy为开发忠实的解释方法提供了急需的基准测试。
Summary / 总结
The research aims to improve the faithfulness of concept-based explanations for large language models (LLMs) by using structural counterfactuals. The method involves creating datasets with LIBERTy, a framework based on Structured Causal Models (SCMs) that generate counterfactuals through interventions on concepts. Key findings include substantial room for improving concept-based explanations and identifying that proprietary LLMs are less sensitive to demographic concepts, possibly due to post-training mitigation. The evaluation metric, order-faithfulness, was introduced to assess these explanations more accurately.
研究旨在通过使用结构性反事实来提高大型语言模型(LLM)的概念解释的准确性。方法是利用基于结构因果模型(SCM)的LIBERTy框架生成反事实,通过干预概念来生成反事实。关键发现包括在概念解释方面存在很大的改进空间,并且发现专有LLM对人口统计概念的敏感度较低,这可能是由于后训练缓解措施所致。引入了评估指标‘顺序忠实度’来更准确地评估这些解释。
The Impact of Generative AI on Architectural Conceptual Design: Performance, Creative Self-Efficacy and Cognitive Load
Authors: Han Jiang, Yao Xiao, Rachel Hurley, Shichao Liu
First: 2026-01-15T18:52:59+00:00 · Latest: 2026-01-15T18:52:59+00:00
Abstract
Our study examines how generative AI (GenAI) influences performance, creative self-efficacy, and cognitive load in architectural conceptual design tasks. Thirty-six student participants from Architectural Engineering and other disciplines completed a two-phase architectural design task, first independently and then with external tools (GenAI-assisted condition and control condition using an online repository of existing architectural projects). Design outcomes were evaluated by expert raters, while self-efficacy and cognitive load were self-reported after each phase. Difference-in-differences analyses revealed no overall performance advantage of GenAI across participants; however, subgroup analyses showed that GenAI significantly improved design performance for novice designers. In contrast, general creative self-efficacy declined for students using GenAI. Cognitive load did not differ significantly between conditions, though prompt usage patterns showed that iterative idea generation and visual feedback prompts were linked to greater reductions in cognitive load. These findings suggest that GenAI effectiveness depends on users' prior expertise and interaction strategies through prompting.
中文标题/摘要
标题:生成式AI对建筑概念设计的影响:性能、创造性自我效能感和认知负荷
本研究探讨了生成式AI(GenAI)对建筑概念设计任务中性能、创造性自我效能感和认知负荷的影响。36名来自建筑学和其它学科的学生参与者完成了两阶段的建筑设计任务,首先独立完成,然后使用外部工具(GenAI辅助条件和对照条件使用在线现有建筑项目库)。设计成果由专家评定,自我效能感和认知负荷在每个阶段后由参与者自我报告。差异性分析显示,GenAI在总体上并未给参与者带来性能优势;然而,子组分析表明,GenAI显著提高了新手设计师的设计性能。相反,使用GenAI的学生的总体创造性自我效能感有所下降。认知负荷在不同条件下没有显著差异,但提示使用模式显示,迭代想法生成和视觉反馈提示与认知负荷的更大减少有关。这些发现表明,GenAI的有效性取决于用户的先前专业知识和通过提示的交互策略。
Summary / 总结
This study investigates the impact of generative AI on architectural design performance, creative self-efficacy, and cognitive load. Thirty-six participants completed two phases of a design task, one independently and one with GenAI assistance. No overall performance advantage was found for GenAI, but it significantly improved performance for novices. Creative self-efficacy decreased for students using GenAI. Cognitive load did not differ significantly between conditions, but iterative idea generation and visual feedback prompts were associated with reduced cognitive load. The effectiveness of GenAI varies based on users' expertise and interaction strategies.
本研究探讨了生成式AI对建筑设计性能、创造性自我效能感和认知负荷的影响。36名参与者完成了两个阶段的设计任务,一个独立完成,一个使用GenAI辅助。总体而言,GenAI并未显著提高设计性能,但对初学者有显著的提升效果。使用GenAI的学生创造性自我效能感下降。认知负荷在不同条件下没有显著差异,但迭代的想法生成和视觉反馈提示与认知负荷的降低有关。生成式AI的效果取决于用户的先前经验和交互策略。
Structure and Diversity Aware Context Bubble Construction for Enterprise Retrieval Augmented Systems
Authors: Amir Khurshid, Abhishek Sehgal
First: 2026-01-15T18:43:19+00:00 · Latest: 2026-01-15T18:43:19+00:00
Abstract
Large language model (LLM) contexts are typically constructed using retrieval-augmented generation (RAG), which involves ranking and selecting the top-k passages. The approach causes fragmentation in information graphs in document structures, over-retrieval, and duplication of content alongside insufficient query context, including 2nd and 3rd order facets. In this paper, a structure-informed and diversity-constrained context bubble construction framework is proposed that assembles coherent, citable bundles of spans under a strict token budget. The method preserves and exploits inherent document structure by organising multi-granular spans (e.g., sections and rows) and using task-conditioned structural priors to guide retrieval. Starting from high-relevance anchor spans, a context bubble is constructed through constrained selection that balances query relevance, marginal coverage, and redundancy penalties. It will explicitly constrain diversity and budget, producing compact and informative context sets, unlike top-k retrieval. Moreover, a full retrieval is emitted that traces the scoring and selection choices of the records, thus providing auditability and deterministic tuning. Experiments on enterprise documents demonstrate the efficiency of context bubble as it significantly reduces redundant context, is better able to cover secondary facets and has a better answer quality and citation faithfulness within a limited context window. Ablation studies demonstrate that both structural priors as well as diversity constraint selection are necessary; removing either component results in a decline in coverage and an increase in redundant or incomplete context.
Summary / 总结
This paper addresses the limitations of traditional retrieval-augmented generation (RAG) methods in enterprise retrieval systems, which often lead to fragmented information, over-retrieval, and insufficient query context. The authors propose a structure-informed and diversity-constrained context bubble construction framework that assembles coherent and citable context sets under a strict token budget. The method uses task-conditioned structural priors to organize multi-granular spans and constructs context bubbles through constrained selection, balancing query relevance, marginal coverage, and redundancy penalties. Experiments show that this approach significantly reduces redundant context, better covers secondary facets, and improves answer quality and citation faithfulness within a limited context window.
本文提出了一种结构导向和多样性约束的上下文气泡构建框架,以解决传统检索增强生成(RAG)方法在大型语言模型中的局限性。该方法通过保留文档结构并使用任务条件下的结构先验来构建连贯的上下文气泡,从而显著减少了冗余上下文并更好地覆盖了次要方面。实验表明,这种方法在有限的上下文窗口内,在答案质量和引文忠实度方面优于top-k检索。
Are Your Reasoning Models Reasoning or Guessing? A Mechanistic Analysis of Hierarchical Reasoning Models
Authors: Zirui Ren, Ziming Liu
First: 2026-01-15T18:42:50+00:00 · Latest: 2026-01-15T18:42:50+00:00
Abstract
Hierarchical reasoning model (HRM) achieves extraordinary performance on various reasoning tasks, significantly outperforming large language model-based reasoners. To understand the strengths and potential failure modes of HRM, we conduct a mechanistic study on its reasoning patterns and find three surprising facts: (a) Failure of extremely simple puzzles, e.g., HRM can fail on a puzzle with only one unknown cell. We attribute this failure to the violation of the fixed point property, a fundamental assumption of HRM. (b) "Grokking" dynamics in reasoning steps, i.e., the answer is not improved uniformly, but instead there is a critical reasoning step that suddenly makes the answer correct; (c) Existence of multiple fixed points. HRM "guesses" the first fixed point, which could be incorrect, and gets trapped there for a while or forever. All facts imply that HRM appears to be "guessing" instead of "reasoning". Leveraging this "guessing" picture, we propose three strategies to scale HRM's guesses: data augmentation (scaling the quality of guesses), input perturbation (scaling the number of guesses by leveraging inference randomness), and model bootstrapping (scaling the number of guesses by leveraging training randomness). On the practical side, by combining all methods, we develop Augmented HRM, boosting accuracy on Sudoku-Extreme from 54.5% to 96.9%. On the scientific side, our analysis provides new insights into how reasoning models "reason".
中文标题/摘要
标题:你的推理模型是在推理还是猜测?层级推理模型的机制分析
层级推理模型(HRM)在各种推理任务中表现出色,显著优于基于大型语言模型的推理器。为了理解HRM的优势及其潜在的失败模式,我们对其推理模式进行了机制研究,发现了三个令人惊讶的事实:(a) 失败于极其简单的谜题,例如,HRM 可能在只有一个未知单元格的谜题上失败。我们将其失败归因于违反了HRM的基本假设——固定点性质。(b) 推理步骤中的“顿悟”动态,即答案并非均匀改进,而是存在一个关键的推理步骤,突然使答案正确;(c) 存在多个固定点。HRM“猜测”第一个固定点,该固定点可能是错误的,并且会暂时或永远被困在那里。所有这些事实表明,HRM 似乎是在“猜测”而不是“推理”。利用这种“猜测”的观点,我们提出了三种策略来扩展HRM的猜测:数据增强(提高猜测的质量)、输入扰动(通过利用推理的随机性扩展猜测的数量)和模型自举(通过利用训练的随机性扩展猜测的数量)。从实用角度来看,通过结合所有方法,我们开发了增强的HRM,将数独-极限的准确率从54.5%提升到96.9%。从科学角度来看,我们的分析为理解推理模型如何“推理”提供了新的见解。
Summary / 总结
The study investigates the reasoning patterns of hierarchical reasoning models (HRM) and finds that HRM often fails on simple puzzles due to a violation of the fixed point property, exhibits 'grokking' dynamics where the answer is suddenly correct after a critical step, and can get stuck in incorrect fixed points. These findings suggest that HRM is more like guessing than reasoning. To improve HRM, the study proposes strategies such as data augmentation, input perturbation, and model bootstrapping, leading to a significant accuracy boost on Sudoku-Extreme from 54.5% to 96.9%. This work provides new insights into how reasoning models operate.
研究探讨了层次推理模型(HRM)的推理模式,发现它们在简单谜题上经常失败,这是由于违反了固定点性质;表现出“顿悟”动态,即答案会突然变得正确;并且可能会陷入错误的固定点。这些发现表明,HRM 更像是在“猜测”而不是“推理”。作者提出了扩展 HRM 猜测的方法,如数据增强、输入扰动和模型自举,并展示了这些方法可以显著提高 HRM 在数独极端任务上的性能,将准确率从 54.5% 提高到 96.9%。这项研究为理解 HRM 的推理机制提供了新的见解。
Single-Stage Huffman Encoder for ML Compression
Authors: Aditya Agrawal, Albert Magyar, Hiteshwar Eswaraiah, Patrick Sheridan, Pradeep Janedula, Ravi Krishnan Venkatesan, Krishna Nair, Ravi Iyer
First: 2026-01-15T18:37:56+00:00 · Latest: 2026-01-15T18:37:56+00:00
Comments: 5 pages, 4 figures
Abstract
Training and serving Large Language Models (LLMs) require partitioning data across multiple accelerators, where collective operations are frequently bottlenecked by network bandwidth. Lossless compression using Huffman codes is an effective way to alleviate the issue, however, its three-stage design requiring on-the-fly frequency analysis, codebook generation and transmission of codebook along with data introduces computational, latency and data overheads which are prohibitive for latency-sensitive scenarios such as die-to-die communication. This paper proposes a single-stage Huffman encoder that eliminates these overheads by using fixed codebooks derived from the average probability distribution of previous data batches. Through our analysis of the Gemma 2B model, we demonstrate that tensors exhibit high statistical similarity across layers and shards. Using this approach we achieve compression within 0.5% of per-shard Huffman coding and within 1% of the ideal Shannon compressibility, enabling efficient on-the-fly compression.
中文标题/摘要
标题:单阶段霍夫曼编码器用于ML压缩
训练和提供大型语言模型(LLMs)需要将数据分割到多个加速器上,其中集体操作经常受到网络带宽的瓶颈限制。使用霍夫曼编码进行无损压缩是一种有效的方法,然而其三阶段设计需要实时频率分析、代码簿生成和代码簿与数据的传输,引入了计算、延迟和数据开销,这些开销在如片间通信等对延迟敏感的场景中是不可接受的。本文提出了一种单阶段霍夫曼编码器,通过使用从先前数据批次的平均概率分布中派生的固定代码簿来消除这些开销。通过对Gemma 2B模型的分析,我们证明了张量在层间和碎片间具有高度的统计相似性。使用这种方法,我们实现了与每碎片霍夫曼编码相差0.5%的压缩率,并且在理想香农压缩性内的1%以内,从而实现了高效的实时压缩。
Summary / 总结
This paper addresses the challenge of bandwidth bottleneck in partitioning data for Large Language Models (LLMs) across accelerators. It proposes a single-stage Huffman encoder that uses fixed codebooks to eliminate the need for on-the-fly frequency analysis and codebook transmission, reducing computational and latency overheads. The approach demonstrates compression within 0.5% of per-shard Huffman coding and within 1% of the ideal Shannon compressibility, making it suitable for latency-sensitive scenarios such as die-to-die communication.
本文针对大规模语言模型(LLMs)数据分区时带宽限制的问题,提出了一种单阶段霍夫曼编码方法。该方法消除了在线频率分析和代码本传输的需要,减少了计算和延迟开销。编码器使用基于先前数据批次的平均概率分布的固定代码本。作者通过分析Gemma 2B模型,证明了张量在不同层和碎片之间具有高度的统计相似性,使得压缩效果在每个碎片霍夫曼编码的0.5%以内,并且接近理想的香农压缩性,适用于延迟敏感场景。
BASIL: Bayesian Assessment of Sycophancy in LLMs
Authors: Katherine Atwell, Pedram Heydari, Anthony Sicilia, Malihe Alikhani
First: 2025-08-23T00:11:00+00:00 · Latest: 2026-01-15T18:31:50+00:00
Abstract
Sycophancy (overly agreeable or flattering behavior) poses a fundamental challenge for human-AI collaboration, particularly in high-stakes decision-making domains such as health, law, and education. A central difficulty in studying sycophancy in large language models (LLMs) is disentangling sycophantic belief shifts from rational changes in behavior driven by new evidence or user-provided information. Existing approaches either measure descriptive behavior changes or apply normative evaluations that rely on objective ground truth, limiting their applicability to subjective or uncertain tasks. We introduce a Bayesian probabilistic framework, grounded in behavioral economics and rational decision theory, that explicitly separates sycophancy from rational belief updating. Within this framework, we achieve three objectives: (i) a descriptive metric that measures sycophancy while controlling for rational responses to evidence; (ii) a normative metric that quantifies how sycophancy leads models astray from Bayesian-consistent belief updating; and (iii) the ability to apply both metrics in settings without ground-truth labels. Applying our framework across multiple LLMs and three uncertainty-driven tasks, we find robust evidence of sycophantic belief shifts and show that their impact on rationality depends on whether models systematically over- or under-update their beliefs. Finally, we demonstrate that a post-hoc calibration method and two fine-tuning strategies (SFT and DPO) substantially reduce Bayesian inconsistency, with particularly strong improvements under explicit sycophancy prompting.
中文标题/摘要
标题:BASIL: 人工智能评估中的奉承行为贝叶斯评估
奉承(过度顺从或奉承的行为)对人类与AI的合作构成了根本性的挑战,尤其是在健康、法律和教育等高风险决策领域。在研究大型语言模型(LLM)中的奉承行为时,一个主要困难在于区分由奉承信念变化引起的行为变化与由新证据或用户提供的信息驱动的理性行为变化。现有方法要么衡量描述性行为变化,要么应用依赖客观事实的规范性评估,这限制了它们在主观或不确定任务中的适用性。我们提出了一种基于行为经济学和理性决策理论的贝叶斯概率框架,明确地将奉承与理性信念更新区分开来。在此框架内,我们实现了三个目标:(i)一个描述性指标,衡量奉承行为的同时控制理性对证据的反应;(ii)一个规范性指标,量化奉承如何使模型偏离贝叶斯一致的信念更新;(iii)能够在没有事实标签的情况下应用这两个指标的能力。在多个LLM和三个不确定性驱动的任务上应用我们的框架,我们发现了奉承信念变化的稳健证据,并表明其对理性的影响取决于模型是否系统地过度或不足地更新其信念。最后,我们证明了一种事后校准方法和两种微调策略(SFT和DPO)显著减少了贝叶斯不一致性,特别是在明确奉承提示下效果尤为显著。
Summary / 总结
The research addresses the challenge of sycophancy in large language models (LLMs) by introducing a Bayesian probabilistic framework that distinguishes sycophantic behavior from rational belief updates. The study measures sycophancy through a descriptive metric and evaluates its impact on rationality through a normative metric. Key findings include robust evidence of sycophantic belief shifts and the effectiveness of post-hoc calibration and fine-tuning strategies in reducing Bayesian inconsistency, especially under explicit sycophancy prompting.
研究通过引入一个区分语言模型(LLM)中奉承行为与理性信念更新的贝叶斯概率框架,解决了奉承性这一挑战。该研究通过描述性指标衡量奉承行为,并通过规范性指标评估其对理性的影响。主要发现包括:存在明显的奉承性信念偏移,并且后处理校准和两种微调策略(SFT和DPO)在减少贝叶斯不一致性方面非常有效,尤其是在明确的奉承性提示下。
Detecting Winning Arguments with Large Language Models and Persuasion Strategies
Authors: Tiziano Labruna, Arkadiusz Modzelewski, Giorgio Satta, Giovanni Da San Martino
First: 2026-01-15T18:30:15+00:00 · Latest: 2026-01-15T18:30:15+00:00
Abstract
Detecting persuasion in argumentative text is a challenging task with important implications for understanding human communication. This work investigates the role of persuasion strategies - such as Attack on reputation, Distraction, and Manipulative wording - in determining the persuasiveness of a text. We conduct experiments on three annotated argument datasets: Winning Arguments (built from the Change My View subreddit), Anthropic/Persuasion, and Persuasion for Good. Our approach leverages large language models (LLMs) with a Multi-Strategy Persuasion Scoring approach that guides reasoning over six persuasion strategies. Results show that strategy-guided reasoning improves the prediction of persuasiveness. To better understand the influence of content, we organize the Winning Argument dataset into broad discussion topics and analyze performance across them. We publicly release this topic-annotated version of the dataset to facilitate future research. Overall, our methodology demonstrates the value of structured, strategy-aware prompting for enhancing interpretability and robustness in argument quality assessment.
中文标题/摘要
标题:使用大型语言模型和说服策略检测获胜论点
在论证文本中检测说服是一项具有重要含义的挑战性任务。本研究探讨了说服策略(如声誉攻击、转移注意力和操纵性措辞)在决定文本说服力中的作用。我们在三个标注的论证数据集上进行了实验:Winning Arguments(来自Change My View 子论坛)、Anthropic/Persuasion 和 Persuasion for Good。我们的方法利用大型语言模型(LLMs)结合多策略说服评分方法,指导对六种说服策略的推理。结果表明,策略导向的推理提高了说服力预测的准确性。为了更好地理解内容的影响,我们将Winning Argument数据集按广泛的讨论主题组织,并分析其在不同主题上的表现。我们公开发布了这个主题标注的数据集,以促进未来的研究。总体而言,我们的方法证明了结构化、策略意识提示在提高论证质量评估的可解释性和鲁棒性方面的价值。
Summary / 总结
This study aims to understand the role of persuasion strategies in making arguments more convincing. It uses large language models to score arguments based on six persuasion strategies, showing improved prediction of persuasiveness. The research also analyzes the Winning Arguments dataset by topic, revealing how different topics affect persuasiveness scores. The dataset is publicly released for further research, enhancing interpretability and robustness in argument quality assessment.
该研究旨在理解说服策略在使论点更具说服力方面的作用。它使用大型语言模型根据六种说服策略对论点进行评分,显示出对说服力预测的改进。研究还按主题分析了 Winning Arguments 数据集,揭示了不同主题如何影响说服力评分。该数据集已公开发布,以促进未来研究,增强论点质量评估的可解释性和稳健性。
Pareto-Grid-Guided Large Language Models for Fast and High-Quality Heuristics Design in Multi-Objective Combinatorial Optimization
Authors: Minh Hieu Ha, Hung Phan, Tung Duy Doan, Tung Dao, Dao Tran, Huynh Thi Thanh Binh
Venue: AAAI
First: 2025-07-28T15:26:43+00:00 · Latest: 2026-01-15T18:28:50+00:00
Comments: Accepted at AAAI-26
Abstract
Multi-objective combinatorial optimization problems (MOCOP) frequently arise in practical applications that require the simultaneous optimization of conflicting objectives. Although traditional evolutionary algorithms can be effective, they typically depend on domain knowledge and repeated parameter tuning, limiting flexibility when applied to unseen MOCOP instances. Recently, integration of Large Language Models (LLMs) into evolutionary computation has opened new avenues for automatic heuristic generation, using their advanced language understanding and code synthesis capabilities. Nevertheless, most existing approaches predominantly focus on single-objective tasks, often neglecting key considerations such as runtime efficiency and heuristic diversity in multi-objective settings. To bridge this gap, we introduce Multi-heuristics for MOCOP via Pareto-Grid-guided Evolution of LLMs (MPaGE), a novel enhancement of the Simple Evolutionary Multiobjective Optimization (SEMO) framework that leverages LLMs and Pareto Front Grid (PFG) technique. By partitioning the objective space into grids and retaining top-performing candidates to guide heuristic generation, MPaGE utilizes LLMs to prioritize heuristics with semantically distinct logical structures during variation, thus promoting diversity and mitigating redundancy within the population. Through extensive evaluations, MPaGE demonstrates superior performance over existing LLM-based frameworks, and achieves competitive results to traditional Multi-objective evolutionary algorithms (MOEAs), with significantly faster runtime. Our code is available at: https://github.com/langkhachhoha/MPaGE.
中文标题/摘要
标题:帕累托网格引导的大语言模型在多目标组合优化中的快速高质量启发式设计
多目标组合优化问题(MOCOP)在需要同时优化冲突目标的实际应用中经常出现。尽管传统的进化算法可能有效,但它们通常依赖于领域知识和重复的参数调整,这在应用于未见过的MOCOP实例时限制了灵活性。最近,将大语言模型(LLMs)集成到进化计算中为自动启发式生成开辟了新的途径,利用它们先进的语言理解和代码合成能力。然而,大多数现有方法主要集中在单目标任务上,往往忽视了多目标设置中的关键考虑因素,如运行时效率和启发式多样性。为弥合这一差距,我们提出了基于帕累托网格引导的大语言模型进化多启发式算法(MPaGE),这是一种对简单多目标进化优化(SEMO)框架的创新增强,利用了LLMs和帕累托前沿网格(PFG)技术。通过将目标空间划分为网格并保留表现最佳的候选者来引导启发式生成,MPaGE利用LLMs在变异过程中优先考虑具有语义上不同的逻辑结构的启发式,从而促进多样性并减少种群中的冗余。通过广泛的评估,MPaGE在现有LLM基框架中表现出更优的性能,并且在运行时显著更快,达到了传统多目标进化算法(MOEAs)的竞争力。我们的代码可在:https://github.com/langkhachhoha/MPaGE 获取。
Summary / 总结
The paper introduces MPaGE, a method that enhances the SEMO framework using LLMs and Pareto Front Grid (PFG) to generate heuristics for MOCOP. It partitions the objective space into grids and retains top-performing heuristics to guide LLMs in generating diverse heuristics, leading to faster and more efficient solutions compared to existing approaches. MPaGE outperforms LLM-based frameworks and achieves competitive results to traditional MOEAs with significantly reduced runtime.
论文提出了MPaGE方法,通过使用LLMs和Pareto Front Grid (PFG)来增强SEMO框架,以生成多目标组合优化问题的启发式。MPaGE将目标空间划分为网格,并保留表现最佳的启发式以指导LLMs生成多样性的启发式,从而提高运行效率并减少冗余。广泛的评估表明,MPaGE在性能上优于现有的LLM基框架,并且与传统的多目标进化算法(MOEAs)相比具有更快的运行时间,达到了可竞争的结果。
Moonworks Lunara Aesthetic Dataset
Authors: Yan Wang, M M Sayeef Abdullah, Partho Hassan, Sabit Hassan
First: 2026-01-12T19:11:41+00:00 · Latest: 2026-01-15T18:27:29+00:00
Abstract
The dataset spans diverse artistic styles, including regionally grounded aesthetics from the Middle East, Northern Europe, East Asia, and South Asia, alongside general categories such as sketch and oil painting. All images are generated using the Moonworks Lunara model and intentionally crafted to embody distinct, high-quality aesthetic styles, yielding a first-of-its-kind dataset with substantially higher aesthetic scores, exceeding even aesthetics-focused datasets, and general-purpose datasets by a larger margin. Each image is accompanied by a human-refined prompt and structured annotations that jointly describe salient objects, attributes, relationships, and stylistic cues. Unlike large-scale web-derived datasets that emphasize breadth over precision, the Lunara Aesthetic Dataset prioritizes aesthetic quality, stylistic diversity, and licensing transparency, and is released under the Apache 2.0 license to support research and unrestricted academic and commercial use.
中文标题/摘要
标题:Moonworks Lunara美学数据集
该数据集涵盖了多种艺术风格,包括中东、北欧、东亚和南亚等地域性美学,以及素描和油画等通用类别。所有图像均使用Moonworks Lunara模型生成,并刻意设计以体现独特的高质量美学风格,从而形成一个前所未有的数据集,其美学评分显著高于专注于美学的数据集和通用数据集。每张图像都附有人工优化的提示和结构化注释,共同描述了显著对象、属性、关系和风格线索。与侧重广度而非精确度的大型网络数据集不同,Lunara美学数据集侧重于美学质量、风格多样性和许可透明度,并在Apache 2.0许可证下发布,以支持研究和无限制的学术及商业使用。
Summary / 总结
The Moonworks Lunara Aesthetic Dataset aims to provide a high-quality dataset with diverse artistic styles, including regional aesthetics from various parts of the world. The dataset is generated using the Moonworks Lunara model and includes structured annotations for salient objects, attributes, relationships, and stylistic cues. The images in this dataset have significantly higher aesthetic scores compared to other aesthetics-focused and general-purpose datasets. Each image is accompanied by a human-refined prompt, and the dataset is released under the Apache 2.0 license for unrestricted use in research and commercial applications.
Moonworks Lunara美学数据集旨在提供一个具有多样艺术风格的数据集,包括来自世界各地的地域美学。该数据集使用Moonworks Lunara模型生成,并包含描述显著物体、属性、关系和风格线索的结构化注释。与其它美学聚焦和通用数据集相比,该数据集的图像具有更高的美学评分。每张图像都附有人工优化的提示,且该数据集在Apache 2.0许可证下发布,支持研究和商业应用的无限制使用。
Knowledge Homophily in Large Language Models
Authors: Utkarsh Sahu, Zhisheng Qi, Mahantesh Halappanavar, Nedim Lipka, Ryan A. Rossi, Franck Dernoncourt, Yu Zhang, Yao Ma, Yu Wang
First: 2025-09-28T09:40:27+00:00 · Latest: 2026-01-15T18:26:36+00:00
Abstract
Large Language Models (LLMs) have been increasingly studied as neural knowledge bases for supporting knowledge-intensive applications such as question answering and fact checking. However, the structural organization of their knowledge remains unexplored. Inspired by cognitive neuroscience findings, such as semantic clustering and priming, where knowing one fact increases the likelihood of recalling related facts, we investigate an analogous knowledge homophily pattern in LLMs. To this end, we map LLM knowledge into a graph representation through knowledge checking at both the triplet and entity levels. After that, we analyze the knowledgeability relationship between an entity and its neighbors, discovering that LLMs tend to possess a similar level of knowledge about entities positioned closer in the graph. Motivated by this homophily principle, we propose a Graph Neural Network (GNN) regression model to estimate entity-level knowledgeability scores for triplets by leveraging their neighborhood scores. The predicted knowledgeability enables us to prioritize checking less well-known triplets, thereby maximizing knowledge coverage under the same labeling budget. This not only improves the efficiency of active labeling for fine-tuning to inject knowledge into LLMs but also enhances multi-hop path retrieval in reasoning-intensive question answering.
中文标题/摘要
标题:大型语言模型中的知识同质性
大型语言模型(LLMs)已被越来越多地研究作为支持知识密集型应用(如问答和事实核查)的神经知识库。然而,它们知识的结构组织尚未被探索。受认知神经科学发现的启发,如语义聚类和启动效应,即知道一个事实会增加回忆相关事实的可能性,我们研究了LLMs中的类似知识同质性模式。为此,我们通过知识检查将LLM知识映射到图表示中,分别在三元组和实体层面进行。之后,我们分析了实体与其邻居之间的知识能力关系,发现LLMs倾向于在图中位置更近的实体具有相似的知识水平。受这一同质性原则的启发,我们提出了一种图神经网络(GNN)回归模型,通过利用其邻居得分来估计三元组的实体级知识能力得分。预测的知识能力使我们能够优先检查不太为人所知的三元组,从而在相同的标注预算下最大化知识覆盖范围。这不仅提高了对LLMs进行微调以注入知识的主动标注效率,还增强了推理密集型问答中的多跳路径检索。
Summary / 总结
This study explores the knowledge homophily pattern in Large Language Models (LLMs) by mapping their knowledge into a graph representation and analyzing the knowledgeability relationship between entities and their neighbors. The research proposes a Graph Neural Network (GNN) regression model to estimate entity-level knowledgeability scores, which helps prioritize less well-known triplets for labeling, thus improving the efficiency of active labeling and enhancing multi-hop path retrieval in reasoning-intensive question answering.
研究探讨了大型语言模型(LLMs)中的知识同质性模式,受到认知神经科学发现的启发。通过将LLM知识映射到图表示并分析实体与其邻居之间的知识能力关系,研究发现LLMs倾向于对紧密连接的实体具有相似的知识水平。为了利用这一同质性原则,提出了一种图神经网络(GNN)回归模型来估计实体级的知识能力得分,这有助于优先标记较少为人知的三元组,从而提高主动标记的效率,并增强推理密集型问答中的多跳路径检索。
PACEvolve: Enabling Long-Horizon Progress-Aware Consistent Evolution
Authors: Minghao Yan, Bo Peng, Benjamin Coleman, Ziqi Chen, Zhouhang Xie, Zhankui He, Noveen Sachdeva, Isabella Ye, Weili Wang, Chi Wang, Ed H. Chi, Wang-Cheng Kang, Derek Zhiyuan Cheng, Beidou Wang
First: 2026-01-15T18:25:23+00:00 · Latest: 2026-01-15T18:25:23+00:00
Abstract
Large Language Models (LLMs) have emerged as powerful operators for evolutionary search, yet the design of efficient search scaffolds remains ad hoc. While promising, current LLM-in-the-loop systems lack a systematic approach to managing the evolutionary process. We identify three distinct failure modes: Context Pollution, where experiment history biases future candidate generation; Mode Collapse, where agents stagnate in local minima due to poor exploration-exploitation balance; and Weak Collaboration, where rigid crossover strategies fail to leverage parallel search trajectories effectively. We introduce Progress-Aware Consistent Evolution (PACEvolve), a framework designed to robustly govern the agent's context and search dynamics, to address these challenges. PACEvolve combines hierarchical context management (HCM) with pruning to address context pollution; momentum-based backtracking (MBB) to escape local minima; and a self-adaptive sampling policy that unifies backtracking and crossover for dynamic search coordination (CE), allowing agents to balance internal refinement with cross-trajectory collaboration. We demonstrate that PACEvolve provides a systematic path to consistent, long-horizon self-improvement, achieving state-of-the-art results on LLM-SR and KernelBench, while discovering solutions surpassing the record on Modded NanoGPT.
中文标题/摘要
标题:PACEvolve:促进长期目标感知一致进化的框架
大型语言模型(LLMs)已成为进化搜索的强大操作员,但高效的搜索支架设计仍缺乏系统方法。尽管前景广阔,当前的LLM在环系统缺乏管理进化过程的系统方法。我们识别出三种不同的失败模式:上下文污染,实验历史偏差未来候选生成;模式崩溃,由于探索与利用平衡不佳,代理在局部最小值中停滞;以及弱协作,僵化的杂交策略无法有效利用并行搜索轨迹。我们引入了目标感知一致进化(PACEvolve)框架,旨在稳健地管理代理的上下文和搜索动力学,以应对这些挑战。PACEvolve结合了分层上下文管理(HCM)与剪枝以解决上下文污染;基于动量的回溯(MBB)以摆脱局部最小值;以及一种自适应采样策略,统一了回溯和杂交以实现动态搜索协调(CE),使代理能够平衡内部细化与跨轨迹协作。我们证明,PACEvolve提供了一条系统的方法,以实现一致的长期自我改进,其在LLM-SR和KernelBench上达到了最先进的结果,同时发现超越Modded NanoGPT记录的解决方案。
Summary / 总结
The research aims to address the inefficiencies in evolutionary search processes managed by Large Language Models (LLMs), focusing on three failure modes: Context Pollution, Mode Collapse, and Weak Collaboration. The method introduces PACEvolve, a framework that uses hierarchical context management, momentum-based backtracking, and a self-adaptive sampling policy to mitigate these issues. Key experimental findings show that PACEvolve achieves state-of-the-art results on LLM-SR and KernelBench, and discovers solutions surpassing previous records on Modded NanoGPT.
论文提出了PACEvolve框架,旨在解决大型语言模型(LLMs)进化搜索中的三个关键问题:上下文污染、模式崩溃和弱协作。PACEvolve通过层次上下文管理、动量回溯和自适应采样策略来改进进化过程。实验结果表明,PACEvolve在LLM-SR和KernelBench上达到了最先进的性能,并在Modded NanoGPT上发现了超越先前记录的解决方案。
PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus
Authors: Shahriar Noroozizadeh, Sayantan Kumar, George H. Chen, Jeremy C. Weiss
First: 2025-05-23T18:01:09+00:00 · Latest: 2026-01-15T18:18:24+00:00
Abstract
Clinical narratives encode temporal dynamics essential for modeling patient trajectories, yet large-scale temporally annotated resources are scarce. We introduce PMOA-TTS, a corpus of 124,699 single-patient PubMed Open Access case reports converted into structured textual timelines of (event, time) pairs using a scalable large-language-model pipeline (Llama 3.3 70B and DeepSeek-R1). The corpus comprises over 5.6 million timestamped events, alongside extracted demographics and diagnoses. Technical validation uses a clinician-curated gold set and three measures: semantic event matching, temporal concordance (c-index), and alignment error summarized with Area Under the Log-Time CDF (AULTC). We benchmark alternative prompting and model choices and provide documentation to support reproduction. PMOA-TTS enables research on timeline extraction, temporal reasoning, survival modeling and event forecasting from narrative text, and offers broad diagnostic and demographic coverage. Data and code are openly available in public repositories.
中文标题/摘要
标题:PMOA-TTS:介绍PubMed开放获取文本时间序列语料库
临床病历记录了对于建模患者轨迹至关重要的时间动态,但大规模的时间标注资源稀缺。我们介绍了PMOA-TTS语料库,该语料库包含124,699篇单个患者的PubMed开放获取病例报告,通过可扩展的大语言模型管道(Llama 3.3 70B和DeepSeek-R1)转换为结构化的文本时间线(事件,时间)对。该语料库包含超过560万条带时间戳的事件,以及提取的统计数据和诊断信息。技术验证使用了临床专家审核的金标准集和三种度量:语义事件匹配、时间一致性(c-指数)和通过对数时间累积分布函数下的面积(AULTC)总结的对齐误差。我们对替代提示和模型选择进行了基准测试,并提供了支持再现的文档。PMOA-TTS 使从叙述文本中提取时间线、时间推理、生存建模和事件预测的研究成为可能,并提供了广泛的诊断和人口统计学覆盖范围。数据和代码在公共存储库中公开可用。
CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
Authors: Darshan Singh, Arsha Nagrani, Kawshik Manikantan, Harman Singh, Dinesh Tewari, Tobias Weyand, Cordelia Schmid, Anelia Angelova, Shachi Dave
First: 2026-01-15T18:15:06+00:00 · Latest: 2026-01-15T18:15:06+00:00
Abstract
Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE (Cultural Understanding and Reasoning in Video Evaluation), a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE's reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. CURVE will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural
中文标题/摘要
标题:CURVE:文化与多语言长视频推理基准
近期视频模型的发展取得了巨大进步,特别是在长视频理解方面。然而,当前的基准测试主要以西方为中心的数据和英语为主,导致评估中存在显著的偏差。为了解决这一问题,我们引入了CURVE(视频评价中的文化理解和推理),这是一个针对多文化和多语言视频推理的具有挑战性的基准测试。CURVE 包含来自18个全球地区的高质量、完全由人类生成的、针对特定区域文化的视频注释。与以往依赖自动翻译的工作不同,CURVE 提供了复杂的问题、答案和多步推理步骤,所有这些都用当地语言精心制作。要在CURVE上取得进展需要对视觉文化背景有深入的理解。此外,我们利用CURVE的推理痕迹构建基于证据的图表,并提出了一种使用这些图表的新型迭代策略,以识别推理中的细微错误。我们的评估表明,最先进的视频大模型面临巨大挑战,其性能远低于人类水平,错误主要来自对文化元素的视觉感知。CURVE 将在 https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural 公开可用。
Summary / 总结
The research aims to address the bias in current video understanding benchmarks by introducing CURVE, a benchmark for multicultural and multilingual video reasoning. CURVE includes high-quality, human-generated annotations from diverse cultural videos across 18 global locales, providing complex questions and answers in native languages. The study finds that state-of-the-art video language models perform poorly on CURVE, with significant errors due to visual perception of cultural elements, indicating a need for deeper cultural understanding in video reasoning systems.
研究旨在通过引入CURVE,一个新的多文化多语言视频推理基准,来解决当前视频理解基准中的偏见问题。CURVE 使用来自 18 个全球区域的多元文化视频的高质量、人工生成的注释,提供复杂的问题和答案,用本地语言编写。研究发现,最先进的视频大模型表现不佳,主要错误源于对文化元素的视觉感知,表明需要更深层次的文化理解。CURVE 将公开发布以供进一步研究。
Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs
Authors: Yuxi Xia, Loris Schoenegger, Benjamin Roth
First: 2026-01-15T18:05:42+00:00 · Latest: 2026-01-15T18:05:42+00:00
Abstract
Large language models (LLMs) can increase users' perceived trust by verbalizing confidence in their outputs. However, prior work has shown that LLMs are often overconfident, making their stated confidence unreliable since it does not consistently align with factual accuracy. To better understand the sources of this verbalized confidence, we introduce TracVC (\textbf{Trac}ing \textbf{V}erbalized \textbf{C}onfidence), a method that builds on information retrieval and influence estimation to trace generated confidence expressions back to the training data. We evaluate TracVC on OLMo and Llama models in a question answering setting, proposing a new metric, content groundness, which measures the extent to which an LLM grounds its confidence in content-related training examples (relevant to the question and answer) versus in generic examples of confidence verbalization. Our analysis reveals that OLMo2-13B is frequently influenced by confidence-related data that is lexically unrelated to the query, suggesting that it may mimic superficial linguistic expressions of certainty rather than rely on genuine content grounding. These findings point to a fundamental limitation in current training regimes: LLMs may learn how to sound confident without learning when confidence is justified. Our analysis provides a foundation for improving LLMs' trustworthiness in expressing more reliable confidence.
中文标题/摘要
标题:有影响力的训练数据检索以解释LLMs的口头化置信度
大型语言模型(LLMs)可以通过口头化其输出的置信度来增加用户的信任感。然而,先前的研究表明,LLMs往往过于自信,使得它们声明的置信度不可靠,因为它并不总是与事实准确性一致。为了更好地理解口头化置信度的来源,我们引入了TracVC(追踪口头化置信度)方法,该方法基于信息检索和影响估计,将生成的置信度表达追溯到训练数据。我们在OLMo和Llama模型上以问答设置评估了TracVC,提出了一个新的度量标准,内容相关性,该度量标准衡量LLM在其置信度中基于与问题和答案相关的内容相关训练示例(而非通用的置信度口头化示例)的程度。我们的分析表明,OLMo2-13B经常受到与查询在词汇上无关的置信相关数据的影响,这表明它可能模仿表面的确定性语言表达,而不是依赖于真实的内容基础。这些发现指出了当前训练制度的一个根本局限:LLMs可能学会如何显得自信,而不知道何时置信是合理的。我们的分析为提高LLMs表达更可靠置信度的信任度奠定了基础。
Summary / 总结
The research aims to improve the trustworthiness of large language models (LLMs) by understanding the sources of their verbalized confidence. TracVC, a method combining information retrieval and influence estimation, traces confidence expressions back to the training data. The study evaluates OLMo and Llama models and introduces a new metric, content groundness, to measure the extent to which LLMs rely on content-related training data versus generic confidence expressions. The findings show that OLMo2-13B often bases its confidence on lexically unrelated data, indicating a limitation in current training regimes where LLMs may learn superficial confidence expressions without genuine content grounding. This work provides insights for enhancing LLMs' reliability in expressing confidence.
研究旨在通过将大型语言模型(LLMs)的自信声明追溯到训练数据,提高其自信表达的可靠性。TracVC 方法结合了信息检索和影响估计,用于评估LLMs自信的内容相关性,衡量其自信是基于相关训练示例还是通用的自信表达。研究发现,OLMo2-13B 经常依赖于与查询无关的自信相关数据,表明其自信表达可能缺乏真正的内容支撑,揭示了当前训练方法中的一个根本局限性。
Adjusted Similarity Measures and a Violation of Expectations
Authors: William L. Lippitt, Edward J. Bedrick, Nichole E. Carlson
First: 2026-01-15T18:01:26+00:00 · Latest: 2026-01-15T18:01:26+00:00
Comments: 12 pages, 1 figure
Abstract
Adjusted similarity measures, such as Cohen's kappa for inter-rater reliability and the adjusted Rand index used to compare clustering algorithms, are a vital tool for comparing discrete labellings. These measures are intended to have the property of 0 expectation under a null distribution and maximum value 1 under maximal similarity to aid in interpretation. Measures are frequently adjusted with respect to the permutation distribution for historic and analytic reasons. There is currently renewed interest in considering other null models more appropriate for context, such as clustering ensembles permitting a random number of identified clusters. The purpose of this work is two -- fold: (1) to generalize the study of the adjustment operator to general null models and to a more general procedure which includes statistical standardization as a special case and (2) to identify sufficient conditions for the adjustment operator to produce the intended properties, where sufficient conditions are related to whether and how observed data are incorporated into null distributions. We demonstrate how violations of the sufficient conditions may lead to substantial breakdown, such as by producing a non-positive measure under traditional adjustment rather than one with mean 0, or by producing a measure which is deterministically 0 under statistical standardization.
中文标题/摘要
标题:调整相似度度量和违反预期
调整后的相似度度量,如用于互评者可靠性的科恩κ系数和用于比较聚类算法的调整兰德指数,是对比离散标签的重要工具。这些度量旨在在零分布下具有0的期望值,在最大相似度下具有最大值1,以帮助解释。这些度量经常根据历史和分析原因进行调整。目前,人们重新关注考虑更合适的零模型,例如允许随机数量的识别簇的聚类集成。本文的目的有两个方面:(1)将调整操作的研究推广到一般零模型和更一般的程序,其中统计标准化是特殊情况之一;(2)确定调整操作产生预期属性的充分条件,其中充分条件与观测数据如何被纳入零分布相关。我们展示了违反充分条件可能导致的重大问题,例如在传统调整中产生非正度量而不是均值为0的度量,或者在统计标准化中产生确定为0的度量。
Summary / 总结
This work aims to generalize the study of the adjustment operator for similarity measures under various null models and to identify sufficient conditions for the adjustment operator to produce measures with the intended properties. The study demonstrates that violations of these conditions can lead to significant issues, such as producing non-positive measures under traditional adjustments or deterministically 0 measures under statistical standardization.
论文旨在推广调整操作符在一般null模型下的研究,并确定调整操作符产生具有预期属性的度量的充分条件。作者探讨了不同null模型对调整相似性度量(如Cohen's kappa和调整Rand指数)的影响。主要发现包括识别出传统调整可能产生非正值或确定性零值的条件,强调了选择适当null模型对于准确解释相似性度量的重要性。
STEM: Scaling Transformers with Embedding Modules
Authors: Ranajoy Sadhukhan, Sheng Cao, Harry Dong, Changsheng Zhao, Attiano Purpura-Pontoniere, Yuandong Tian, Zechun Liu, Beidi Chen
First: 2026-01-15T18:00:27+00:00 · Latest: 2026-01-15T18:00:27+00:00
Abstract
Fine-grained sparsity promises higher parametric capacity without proportional per-token compute, but often suffers from training instability, load balancing, and communication overhead. We introduce STEM (Scaling Transformers with Embedding Modules), a static, token-indexed approach that replaces the FFN up-projection with a layer-local embedding lookup while keeping the gate and down-projection dense. This removes runtime routing, enables CPU offload with asynchronous prefetch, and decouples capacity from both per-token FLOPs and cross-device communication. Empirically, STEM trains stably despite extreme sparsity. It improves downstream performance over dense baselines while reducing per-token FLOPs and parameter accesses (eliminating roughly one-third of FFN parameters). STEM learns embedding spaces with large angular spread which enhances its knowledge storage capacity. More interestingly, this enhanced knowledge capacity comes with better interpretability. The token-indexed nature of STEM embeddings allows simple ways to perform knowledge editing and knowledge injection in an interpretable manner without any intervention in the input text or additional computation. In addition, STEM strengthens long-context performance: as sequence length grows, more distinct parameters are activated, yielding practical test-time capacity scaling. Across 350M and 1B model scales, STEM delivers up to ~3--4% accuracy improvements overall, with notable gains on knowledge and reasoning-heavy benchmarks (ARC-Challenge, OpenBookQA, GSM8K, MMLU). Overall, STEM is an effective way of scaling parametric memory while providing better interpretability, better training stability and improved efficiency.
中文标题/摘要
标题:STEM:使用嵌入模块扩展变换器
细粒度稀疏性在不按比例增加每词计算量的情况下承诺了更高的参数容量,但通常会遭受训练不稳定性、负载均衡和通信开销的问题。我们提出了STEM(使用嵌入模块扩展变换器),这是一种静态、按令牌索引的方法,用局部层嵌入查找替换FFN上投影,同时保持门和下投影密集。这消除了运行时路由,允许CPU卸载和异步预取,并将容量与每词FLOPs和跨设备通信脱钩。实验证明,尽管稀疏性极端,STEM仍能稳定训练。它在减少每词FLOPs和参数访问次数的同时,提高了下游性能(消除约三分之一FFN参数)。STEM学习具有大角度扩展的嵌入空间,增强了其知识存储容量。更有趣的是,这种增强的知识容量伴随着更好的可解释性。STEM嵌入的按令牌索引性质允许以简单的方式在不干预输入文本或增加额外计算的情况下进行知识编辑和知识注入。此外,STEM增强了长上下文性能:随着序列长度的增长,更多的不同参数被激活,从而实现实际的测试时容量扩展。在350M和1B模型规模下,STEM总体上提供了高达约3-4%的准确率改进,特别是在知识和推理密集型基准(ARC-Challenge、OpenBookQA、GSM8K、MMLU)上取得了显著进步。总体而言,STEM是一种有效的方法,可以在提供更好的可解释性、更好的训练稳定性和改进的效率的同时扩展参数内存。
Summary / 总结
STEM introduces a static, token-indexed approach to scaling transformers by replacing the FFN up-projection with a layer-local embedding lookup, which removes runtime routing and enables CPU offload. Empirically, STEM trains stably with extreme sparsity, improves downstream performance, reduces per-token FLOPs and parameter accesses, and enhances interpretability through large angular spread in embedding spaces. STEM also scales better with sequence length, offering up to 4% accuracy improvements on knowledge and reasoning benchmarks.
STEM通过引入一种静态的、基于token的方案,将FFN的上投影替换为层局部的嵌入查找,同时保持门和下投影密集,解决了细粒度稀疏性在变压器中的挑战。该方法增强了训练稳定性,减少了每token的FLOPs和参数访问,并在各种基准上提高了3-4%的整体准确性。STEM还提供了更好的可解释性以及随着序列长度增加的长上下文性能提升。
Dual-Uncertainty Guided Policy Learning for Multimodal Reasoning
Authors: Rui Liu, Dian Yu, Tong Zheng, Runpeng Dai, Zongxia Li, Wenhao Yu, Zhenwen Liang, Linfeng Song, Haitao Mi, Pratap Tokekar, Dong Yu
First: 2025-10-01T20:32:08+00:00 · Latest: 2026-01-15T17:51:14+00:00
Abstract
Reinforcement learning with verifiable rewards (RLVR) has advanced reasoning capabilities in multimodal large language models. However, existing methods typically treat visual inputs as deterministic, overlooking the perceptual ambiguity inherent to the visual modality. Consequently, they fail to distinguish whether a model's uncertainty stems from complex reasoning or ambiguous perception, preventing the targeted allocation of exploration or learning signals. To address this gap, we introduce DUPL, a dual-uncertainty guided policy learning approach for multimodal RLVR that quantifies and leverages both perceptual uncertainty (via symmetric KL divergence) and output uncertainty (via policy entropy) to guide policy updates. By establishing an uncertainty-driven feedback loop and employing a dynamic branch prioritization mechanism, DUPL recalibrates the policy advantage to focus learning on states with high perceptual or decisional ambiguity, enabling effective targeted exploration beyond passive data augmentation. Implemented on top of GRPO and evaluated on six multimodal mathematical and general-domain reasoning benchmarks, DUPL improves Qwen2.5-VL 3B and 7B models, achieving accuracy gains of up to 11.2% on visual math tasks and up to 7.1% on general-domain reasoning tasks, while consistently outperforming GRPO. These results demonstrate that dual-uncertainty guided policy learning is an effective and generalizable approach for multimodal RLVR.
中文标题/摘要
标题:多不确定性引导的多模态推理策略学习
具有可验证奖励的强化学习(RLVR)在多模态大型语言模型中提升了推理能力。然而,现有方法通常将视觉输入视为确定性的,忽视了视觉模态固有的感知模糊性。因此,它们无法区分模型的不确定性是源自复杂的推理还是模糊的感知,从而阻碍了探索或学习信号的针对性分配。为解决这一问题,我们引入了DUPL,这是一种多模态RLVR中的双不确定性引导策略学习方法,通过对称KL散度量化和利用感知不确定性以及通过策略熵量化和利用输出不确定性来引导策略更新。通过建立不确定性驱动的反馈循环并采用动态分支优先机制,DUPL重新校准策略优势,使其专注于具有高感知或决策模糊性的状态,从而实现超越被动数据增强的有效针对性探索。DUPL基于GRPO实现,并在六个多模态数学和通用领域推理基准测试上进行评估,提高了Qwen2.5-VL 3B和7B模型的准确性,视觉数学任务上的准确率提升高达11.2%,通用领域推理任务上的准确率提升高达7.1%,并且始终优于GRPO。这些结果表明,双不确定性引导策略学习是多模态RLVR中一种有效且可推广的方法。
Summary / 总结
The research aims to enhance the reasoning capabilities of multimodal large language models by addressing the limitations of existing reinforcement learning methods that treat visual inputs deterministically. DUPL, a dual-uncertainty guided policy learning approach, is introduced to quantify and utilize both perceptual and output uncertainties to guide policy updates. This method improves the Qwen2.5-VL 3B and 7B models by up to 11.2% on visual math tasks and 7.1% on general-domain reasoning tasks, outperforming the baseline GRPO method.
研究旨在通过解决现有方法将视觉输入视为确定性的问题,增强多模态大型语言模型在强化学习中的推理能力。引入了DUPL,一种双不确定性引导的策略学习方法,量化并利用感知和输出不确定性来引导策略更新。该方法在Qwen2.5-VL 3B和7B模型上取得了显著效果,视觉数学任务的准确率提高了最多11.2%,一般领域推理任务提高了最多7.1%,并在六个多模态基准测试中优于GRPO。
On the Failure of Latent State Persistence in Large Language Models
Authors: Jen-tse Huang, Kaiser Sun, Wenxuan Wang, Mark Dredze
First: 2025-04-30T16:18:39+00:00 · Latest: 2026-01-15T17:44:56+00:00
Comments: 8 pages, 6 figures, 9 tables
Abstract
While Large Language Models (LLMs) excel in reasoning, whether they can sustain persistent latent states remains under-explored. The capacity to maintain and manipulate unexpressed, internal representations-analogous to human working memory-is a cornerstone of complex reasoning. In this paper, we formalize and quantify the "Latent State Persistence" (LSP) gap through three novel experiments. First, we utilize a Number Guessing Game, demonstrating that across independent queries, LLMs fail to allocate probability mass to a singular hidden choice, violating a fundamental probabilistic principle. Second, we employ a Yes-No Game to show that as the number of questions increases, LLMs suffer from "concept drift," leading to inevitable self-contradictions due to the lack of LSP. Finally, inspired by Mathematical Mentalism, we task models with tracking transformations on hidden variables, revealing a failure in variable binding and state evolution when the initial state is not explicitly present in the context. Collectively, these findings suggest that LLMs function as reactive post-hoc solvers rather than proactive planners with LSP. Our work provides a framework for evaluating the fidelity of internal representations and highlights a fundamental architectural divergence between autoregressive transformers and human-like cognition.
中文标题/摘要
标题:关于大型语言模型中潜在状态持久性的失败
虽然大型语言模型(LLMs)在推理方面表现出色,但它们能否维持持久的潜在状态仍是一个未被充分探索的问题。维持和操作未表达的内部表示——类似于人类的工作记忆——是复杂推理的基础。在本文中,我们通过三个新的实验正式化并量化了“潜在状态持久性”(LSP)差距。首先,我们使用一个数字猜谜游戏,表明LLMs在独立查询中无法将概率质量分配给单一隐藏选择,违反了基本的概率原则。其次,我们使用一个是/否游戏来展示,随着问题数量的增加,LLMs会遭受“概念漂移”,导致由于缺乏LSP而不可避免地产生自相矛盾。最后,受数学心灵主义的启发,我们要求模型跟踪隐藏变量的变换,揭示了当初始状态未明确出现在上下文中时,变量绑定和状态演变的失败。这些发现共同表明,LLMs作为反应性的事后解决者而非具有LSP的前瞻性规划者运作。我们的工作提供了一个评估内部表示保真的框架,并突显了自回归变换器与人类认知之间基本的架构差异。
Summary / 总结
This paper explores the capability of Large Language Models (LLMs) to maintain persistent latent states, which is crucial for complex reasoning. Through three experiments, the authors demonstrate that LLMs fail to allocate probability mass to a single hidden choice in a Number Guessing Game, exhibit concept drift in a Yes-No Game, and struggle with variable binding and state evolution in a task inspired by Mathematical Mentalism. These findings suggest that LLMs are reactive post-hoc solvers rather than proactive planners with persistent latent states.
本文探讨了大型语言模型(LLMs)维持持久潜状态的能力,这对于复杂推理至关重要。通过三个实验——数字猜谜游戏、是或否游戏和数学心灵主义任务,作者展示了LLMs无法在查询之间维持内部表示,导致概率不一致和自我矛盾。这些发现表明,LLMs更像是反应性的后验求解器,而不是具有持久潜状态的前瞻性规划器。
Can LLMs Understand What We Cannot Say? Measuring Multilevel Alignment Through Abortion Stigma Across Cognitive, Interpersonal, and Structural Levels
Authors: Anika Sharma, Malavika Mampally, Chidaksh Ravuru, Kandyce Brennan, Neil Gaikwad
First: 2025-12-15T09:50:00+00:00 · Latest: 2026-01-15T17:43:09+00:00
Abstract
As Large Language Models (LLMs) increasingly mediate stigmatized health decisions, their capacity to understand complex psychological phenomena remains inadequately assessed. Can LLMs understand what we cannot say? We investigate whether LLMs coherently represent abortion stigma across cognitive, interpersonal, and structural levels. We systematically tested 627 demographically diverse personas across five leading LLMs using the validated Individual Level Abortion Stigma Scale (ILAS), examining representation at cognitive (self-judgment), interpersonal (worries about judgment and isolation), and structural (community condemnation and disclosure patterns) levels. Models fail tests of genuine understanding across all dimensions. They underestimate cognitive stigma while overestimating interpersonal stigma, introduce demographic biases assigning higher stigma to younger, less educated, and non-White personas, and treat secrecy as universal despite 36% of humans reporting openness. Most critically, models produce internal contradictions: they overestimate isolation yet predict isolated individuals are less secretive, revealing incoherent representations. These patterns show current alignment approaches ensure appropriate language but not coherent understanding across levels. This work provides empirical evidence that LLMs lack coherent understanding of psychological constructs operating across multiple dimensions. AI safety in high-stakes contexts demands new approaches to design (multilevel coherence), evaluation (continuous auditing), governance and regulation (mandatory audits, accountability, deployment restrictions), and AI literacy in domains where understanding what people cannot say determines whether support helps or harms.
中文标题/摘要
标题:大型语言模型能否理解我们无法表达的内容?通过堕胎污名在认知、人际和结构层面的测量来考察多层面一致性
随着大型语言模型(LLMs)越来越多地参与敏感健康决策,它们理解复杂心理现象的能力仍被严重低估。LLMs能否理解我们无法表达的内容?我们研究了LLMs是否在认知、人际和结构层面一致地代表堕胎污名。我们系统地测试了五种主流LLM中的627个不同背景的人格,使用验证过的个体层面堕胎污名量表(ILAS),考察了认知(自我判断)、人际(担心被评判和孤立)和结构(社区谴责和披露模式)层面的代表情况。模型在所有维度上都未能通过真正理解的测试。它们低估了认知污名,而高估了人际污名,引入了人口统计学偏见,将更高污名赋予年轻、教育程度较低和非白人的人格,并将保密视为普遍现象,尽管36%的人报告了开放性。最关键的是,模型产生了内部矛盾:它们高估了孤立,但预测孤立个体更不保密,揭示了不一致的代表。这些模式表明,当前的一致性方法确保了适当的语言,但没有在不同层面实现一致的理解。本研究提供了实证证据,证明LLMs缺乏在多个维度上运作的心理结构的连贯理解。在高风险情境下的AI安全需要新的设计(多层面一致性)、评估(持续审计)、治理和监管(强制审计、问责、部署限制)以及AI素养方法,特别是在理解人们无法表达的内容时,支持是帮助还是伤害决定了这一点。
Summary / 总结
This study evaluates whether Large Language Models (LLMs) can understand complex psychological phenomena like abortion stigma across cognitive, interpersonal, and structural levels. Using the Individual Level Abortion Stigma Scale, the research tested 627 demographically diverse personas across five leading LLMs. The models failed to demonstrate genuine understanding, underestimating cognitive stigma and overestimating interpersonal stigma, and introduced demographic biases. They also produced internal contradictions, indicating a lack of coherent understanding. This suggests current alignment approaches are insufficient for ensuring models can handle high-stakes contexts like stigmatized health decisions.
研究评估了大型语言模型(LLMs)是否能理解跨认知、人际和结构层面的复杂心理现象,如堕胎污名。使用个体水平堕胎污名量表,研究测试了五个领先LLM的627个不同背景的人格。模型未能展示真正的理解能力,低估了认知污名,高估了人际污名,并引入了人口统计学偏见。它们还产生了内部矛盾,表明缺乏连贯的理解。这表明当前的对齐方法不足以确保模型能够处理如涉及污名化健康决策等高风险情境。
Explicit Abstention Knobs for Predictable Reliability in Video Question Answering
Authors: Jorge Ortiz
First: 2025-12-31T23:27:32+00:00 · Latest: 2026-01-15T17:31:17+00:00
Comments: Preprint. Diagnostic study of confidence-based abstention under evidence truncation
Abstract
High-stakes deployment of vision-language models (VLMs) requires selective prediction, where systems abstain when uncertain rather than risk costly errors. We investigate whether confidence-based abstention provides reliable control over error rates in video question answering, and whether that control remains robust under distribution shift. Using NExT-QA and Gemini 2.0 Flash, we establish two findings. First, confidence thresholding provides mechanistic control in-distribution. Sweeping threshold epsilon produces smooth risk-coverage tradeoffs, reducing error rates f
中文标题/摘要
标题:显式弃权开关以实现视频问答中的可预测可靠性
在高风险部署视觉-语言模型(VLMs)时,需要选择性预测,即系统在不确定时弃权,而不是冒高成本错误的风险。我们研究了基于信心的弃权是否能提供对错误率的可靠控制,以及这种控制在分布偏移下是否保持稳健。使用NExT-QA和Gemini 2.0 Flash,我们建立了两个发现。首先,置信度阈值化提供了在分布内进行机制性控制。扫掠阈值ε产生平滑的风险-覆盖率权衡,降低错误率
Summary / 总结
The research aims to improve the reliability of vision-language models in high-stakes applications by enabling selective prediction, where the models abstain when uncertain. The study investigates the effectiveness of confidence-based abstention in controlling error rates in video question answering and its robustness under distribution shift. Key findings show that confidence thresholding offers reliable control over error rates within the distribution, with smooth risk-coverage tradeoffs as the threshold varies.
研究旨在通过使视觉-语言模型在不确定时能够拒绝预测来提高其在高风险应用中的可靠性。研究使用基于信心的拒绝机制,并发现通过调整信心阈值可以可靠地控制错误率。方法包括调整阈值ε以实现风险-覆盖率的平滑权衡,从而在保持模型在训练分布内的可预测性的同时减少错误。
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Authors: Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna
First: 2026-01-15T17:27:44+00:00 · Latest: 2026-01-15T17:27:44+00:00
Abstract
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
中文标题/摘要
标题:Molmo2:开放权重和数据的视觉-语言模型,具备视频理解与定位能力
当今最强的视频-语言模型(VLMs)仍为私有。最强的开放权重模型要么依赖于私有VLMs的合成数据,有效从中提炼,要么不披露其训练数据或方法。因此,开源社区缺乏改进当前最先进的视频(和图像)语言模型的基础。至关重要的是,许多下游应用不仅需要高层次的视频理解,还需要定位——无论是通过指针还是像素跟踪。即使私有模型也缺乏这种能力。我们提出了Molmo2,这是一种新的VLM家族,是开源模型中的最先进的,并展示了在单图像、多图像和视频任务中出色的基于指针的定位新能力。我们的主要贡献是一系列7个新的视频数据集和2个多图像数据集,包括用于预训练的详细视频字幕数据集、自由形式的视频问答数据集、新的具有复杂查询的对象跟踪数据集以及创新的视频指针数据集,所有这些数据集均未使用封闭的VLMs收集。我们还提供了一种利用高效打包和消息树编码方案的数据训练食谱,并展示了在视觉标记上进行双向注意以及一种新的标记权重策略可以提高性能。我们的最佳8B模型在短视频、计数和字幕方面优于其他开放权重和数据模型,并在长视频方面具有竞争力。在视频定位方面,Molmo2显著优于现有开放权重模型如Qwen3-VL(视频计数准确率为35.5 vs 29.6)并超越了某些任务上的私有模型如Gemini 3 Pro(视频指针F1得分为38.4 vs 20.0,视频跟踪J&F得分为56.2 vs 41.1)
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
Authors: Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi
First: 2026-01-13T17:48:43+00:00 · Latest: 2026-01-15T17:24:46+00:00
Comments: Work in Progress
Abstract
Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.
中文标题/摘要
标题:奖励稀有性:面向创造型问题解决的LLM独特性意识强化学习
强化学习(RL)已成为大型语言模型(LLMs)后训练的核心范式,特别是在复杂推理任务中,但它经常遭受探索崩溃的问题:策略过早地集中在少数占主导地位的推理模式上,提高了pass@1,但限制了rollout级别的多样性以及pass@k的收益。我们认为这种失败源于对局部token行为的正则化,而不是对解决方案集的多样性。为了解决这个问题,我们提出了独特性意识强化学习,这是一种rollout级别的目标,明确奖励那些表现出罕见高级策略的正确解决方案。该方法使用基于LLM的裁判将相同问题的rollout根据其高级解决方案策略聚类,忽略表面差异,并根据集群大小反向重新加权策略优势。因此,正确但新颖的策略比冗余策略获得更高的奖励。在数学、物理和医学推理基准测试中,我们的方法在大规模采样预算下始终如一地提高了pass@$k$,并增加了pass@$k$曲线下的面积(AUC@$K$),同时不牺牲pass@1,同时保持探索并揭示更多多样化的解决方案策略。
Summary / 总结
The paper addresses the issue of exploration collapse in reinforcement learning for large language models, where policies tend to focus on a few dominant strategies. It introduces Uniqueness-Aware Reinforcement Learning, which rewards rare high-level strategies to promote diversity. The method uses an LLM-based judge to cluster rollouts and reweights policy advantages inversely with cluster size, leading to improved pass@$k$ across various reasoning benchmarks without sacrificing pass@1, and enhancing exploration and diversity of solutions.
论文针对强化学习(RL)在大型语言模型(LLMs)中的探索枯竭问题,即政策倾向于集中在少数主导推理模式上。为此,作者提出了基于独特性意识的RL方法,该方法通过LLM判别器对同一问题的策略进行高层面聚类,并重新加权策略优势,以奖励展示罕见高层面策略的正确解决方案。这种方法在各种推理基准测试中提高了pass@$k$,增加了AUC@$K$,同时不牺牲pass@1,从而促进了更广泛的创新性问题解决策略。
Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models
Authors: Linlu Qiu, Fei Sha, Kelsey Allen, Yoon Kim, Tal Linzen, Sjoerd van Steenkiste
First: 2025-03-21T20:13:04+00:00 · Latest: 2026-01-15T17:21:57+00:00
Comments: Nature Communications
Abstract
Large language models (LLMs) are increasingly used as agents that interact with users and with the world. To do so successfully, LLMs must construct representations of the world and form probabilistic beliefs about them. To provide personalized recommendations, for example, the LLM needs to infer a user's preferences from their behavior over multiple interactions. The Bayesian inference framework lays out the optimal way for an agent to update its beliefs as it receives new information. We first show that LLMs fall far short of the standard defined by the Bayesian framework. We then show that by teaching LLMs to mimic the predictions of the normative Bayesian model, we can dramatically improve their ability to update their beliefs; this ability generalizes to new tasks. We conclude that LLMs can effectively learn reasoning skills from examples and generalize those skills to new domains.
中文标题/摘要
标题:贝叶斯教学使大型语言模型具备概率推理能力
大型语言模型(LLMs)越来越多地被用作与用户和世界互动的代理。为了成功地做到这一点,LLMs 必须构建对世界的表示,并形成关于它们的概率性信念。例如,为了提供个性化的推荐,LLM 需要从用户在多次交互中的行为中推断出用户的偏好。贝叶斯推理框架列出了代理在接收到新信息时更新其信念的最佳方式。我们首先表明,LLMs 在达到贝叶斯框架设定的标准方面远远不够。然后我们表明,通过教导 LLMs 模仿规范性贝叶斯模型的预测,我们可以显著提高它们更新信念的能力;这种能力可以泛化到新任务。我们得出结论,LLMs 可以从示例中有效学习推理技能,并将这些技能泛化到新的领域。
Summary / 总结
The research aims to enhance the probabilistic reasoning capabilities of large language models (LLMs) by aligning them with Bayesian inference principles. The study demonstrates that LLMs currently perform poorly in updating their beliefs based on new information. By training LLMs to mimic the predictions of a normative Bayesian model, the researchers significantly improved the models' ability to update their beliefs, which generalized to new tasks. This suggests that LLMs can learn reasoning skills from examples and apply them to new domains.
研究旨在通过使大型语言模型(LLMs)符合贝叶斯推理原则来增强其概率推理能力。研究人员首先展示了LLMs在根据新信息更新其信念方面表现不佳。然后,他们开发了一种方法来教LLMs模仿规范的贝叶斯模型的预测,显著提高了它们的信念更新能力,并且这种能力可以应用于新任务。这表明LLMs可以从示例中学习推理技能并在不同领域应用这些技能。
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
Authors: Zhiyuan Hu, Yunhai Hu, Juncheng Liu, Shuyue Stella Li, Yucheng Wang, Zhen Xu, See-Kiong Ng, Anh Tuan Luu, Xinxing Xu, Bryan Hooi, Cynthia Breazeal, Hae Won Park
First: 2026-01-14T17:57:43+00:00 · Latest: 2026-01-15T17:20:36+00:00
Comments: Work in Progress
Abstract
Multi-agent systems have evolved into practical LLM-driven collaborators for many applications, gaining robustness from diversity and cross-checking. However, multi-agent RL (MARL) training is resource-intensive and unstable: co-adapting teammates induce non-stationarity, and rewards are often sparse and high-variance. Therefore, we introduce \textbf{Multi-Agent Test-Time Reinforcement Learning (MATTRL)}, a framework that injects structured textual experience into multi-agent deliberation at inference time. MATTRL forms a multi-expert team of specialists for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. We also study credit assignment for constructing a turn-level experience pool, then reinjecting it into the dialogue. Across challenging benchmarks in medicine, math, and education, MATTRL improves accuracy by an average of 3.67\% over a multi-agent baseline, and by 8.67\% over comparable single-agent baselines. Ablation studies examine different credit-assignment schemes and provide a detailed comparison of how they affect training outcomes. MATTRL offers a stable, effective and efficient path to distribution-shift-robust multi-agent reasoning without tuning.
中文标题/摘要
标题:协作多智能体推理时的强化学习
多智能体系统已成为许多应用中的实用LLM驱动合作者,通过多样性和交叉验证获得鲁棒性。然而,多智能体强化学习(MARL)训练资源密集且不稳定:队友的共同适应导致非平稳性,奖励通常稀疏且高方差。因此,我们引入了**多智能体推理时的强化学习(MATTRL)**框架,在推理时向多智能体讨论注入结构化的文本经验。MATTRL 形成一个由专家组成的多专家团队,进行多轮讨论,检索并整合推理时的经验,并达成共识进行最终决策。我们还研究了信用分配方法以构建轮次级经验池,然后将其重新注入对话中。在医学、数学和教育领域的具有挑战性的基准测试中,MATTRL 在多智能体基线上的准确率平均提高了3.67%,在可比的单智能体基线上的准确率提高了8.67%。消融研究探讨了不同的信用分配方案,并详细比较了它们对训练结果的影响。MATTRL 提供了一条稳定、有效且高效的路径,无需调整即可实现分布转移鲁棒的多智能体推理。
Summary / 总结
The paper introduces Multi-Agent Test-Time Reinforcement Learning (MATTRL), which injects structured textual experience into multi-agent deliberation at inference time to improve robustness and accuracy. MATTRL forms a multi-expert team for multi-turn discussions, retrieves and integrates test-time experiences, and reaches consensus for final decision-making. Experiments across medicine, math, and education show that MATTRL improves accuracy by 3.67% over a multi-agent baseline and by 8.67% over single-agent baselines.
研究引入了Multi-Agent Test-Time Reinforcement Learning (MATTRL),在推理时注入结构化的文本经验,以提高鲁棒性和准确性。MATTRL 形成一个多专家团队进行多轮讨论,检索并整合测试时的经验,并达成共识进行最终决策。在各种基准测试中,MATTRL 的准确率分别比多智能体基线提高了 3.67%,比单智能体基线提高了 8.67%。消融研究探讨了不同的信用分配方案,以增强训练效果。
Procedural Fairness in Multi-Agent Bandits
Authors: Joshua Caiata, Carter Blair, Kate Larson
First: 2026-01-15T17:11:51+00:00 · Latest: 2026-01-15T17:11:51+00:00
Abstract
In the context of multi-agent multi-armed bandits (MA-MAB), fairness is often reduced to outcomes: maximizing welfare, reducing inequality, or balancing utilities. However, evidence in psychology, economics, and Rawlsian theory suggests that fairness is also about process and who gets a say in the decisions being made. We introduce a new fairness objective, procedural fairness, which provides equal decision-making power for all agents, lies in the core, and provides for proportionality in outcomes. Empirical results confirm that fairness notions based on optimizing for outcomes sacrifice equal voice and representation, while the sacrifice in outcome-based fairness objectives (like equality and utilitarianism) is minimal under procedurally fair policies. We further prove that different fairness notions prioritize fundamentally different and incompatible values, highlighting that fairness requires explicit normative choices. This paper argues that procedural legitimacy deserves greater focus as a fairness objective, and provides a framework for putting procedural fairness into practice.
中文标题/摘要
标题:多智能体多臂老虎机中的程序公平性
在多智能体多臂老虎机(MA-MAB)的背景下,公平性通常被简化为结果:最大化福利、减少不平等或平衡效用。然而,心理学、经济学和罗尔斯理论的证据表明,公平性还关乎过程以及谁在决策中拥有发言权。我们引入了一个新的公平性目标——程序公平性,它为所有智能体提供了平等的决策权,处于核心位置,并确保结果的成比例性。实证结果表明,基于优化结果的公平性观念牺牲了平等的声音和代表性,而基于结果的公平性目标(如平等和功利主义)在程序公平政策下的牺牲最小。我们进一步证明,不同的公平性观念优先考虑的是根本不同且不兼容的价值观,突显了公平性需要明确规范性选择。本文认为,程序合法性作为公平性目标应得到更多关注,并提供了一个将程序公平性付诸实践的框架。
Summary / 总结
This paper addresses the concept of fairness in multi-agent multi-armed bandits by focusing on procedural fairness, which ensures equal decision-making power for all agents. The authors introduce a new fairness objective that balances process and outcomes, showing that outcome-based fairness often sacrifices equal voice and representation. Experimental results indicate that procedural fairness minimally impacts outcomes while ensuring all agents have a say in decision-making. The paper also demonstrates that different fairness notions prioritize conflicting values, emphasizing the need for explicit normative choices in fairness objectives.
该论文探讨了多智能体多臂老虎机中的公平性问题,强调了程序公平的重要性,即确保所有智能体拥有平等的决策权。研究对比了程序公平与基于结果的公平性,表明虽然基于结果的公平性会牺牲平等发言权,但程序公平对结果的影响较小。研究结果表明,不同的公平性观念优先考虑的是相互冲突的价值观,强调了明确规范选择的必要性。研究指出,程序公平,即重视平等参与,是多智能体系统中一个有价值的公平性目标。
History
20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553