WorldCam: Interactive Autoregressive 3D Gaming Worlds with Camera Pose as a Unifying Geometric Representation
Authors: Jisu Nam, Yicong Hong, Chun-Hao Paul Huang, Feng Liu, JoungBin Lee, Jiyoung Kim, Siyoon Jin, Yunsung Lee, Jaeyoon Jung, Suhwan Choi, Seungryong Kim, Yang Zhou
First: 2026-03-17T17:59:56+00:00 · Latest: 2026-03-17T17:59:56+00:00
Comments: Project page is available at https://cvlab-kaist.github.io/WorldCam/
Abstract
Recent advances in video diffusion transformers have enabled interactive gaming world models that allow users to explore generated environments over extended horizons. However, existing approaches struggle with precise action control and long-horizon 3D consistency. Most prior works treat user actions as abstract conditioning signals, overlooking the fundamental geometric coupling between actions and the 3D world, whereby actions induce relative camera motions that accumulate into a global camera pose within a 3D world. In this paper, we establish camera pose as a unifying geometric representation to jointly ground immediate action control and long-term 3D consistency. First, we define a physics-based continuous action space and represent user inputs in the Lie algebra to derive precise 6-DoF camera poses, which are injected into the generative model via a camera embedder to ensure accurate action alignment. Second, we use global camera poses as spatial indices to retrieve relevant past observations, enabling geometrically consistent revisiting of locations during long-horizon navigation. To support this research, we introduce a large-scale dataset comprising 3,000 minutes of authentic human gameplay annotated with camera trajectories and textual descriptions. Extensive experiments show that our approach substantially outperforms state-of-the-art interactive gaming world models in action controllability, long-horizon visual quality, and 3D spatial consistency.
中文标题/摘要
标题:WorldCam:以相机姿态为统一几何表示的交互式自回归3D游戏世界
近期视频扩散变换器的发展使交互式游戏世界模型得以实现,用户可以探索生成的环境,但现有方法在精确动作控制和长时3D一致性方面存在困难。大多数先前的工作将用户动作视为抽象的条件信号,忽视了动作与3D世界的几何耦合,动作会诱导相对相机运动,这些运动在3D世界中累积形成全局相机姿态。在本文中,我们确立相机姿态为统一的几何表示,以联合实现即时动作控制和长期3D一致性。首先,我们定义了一个基于物理的连续动作空间,并将用户输入表示为李代数,以推导出精确的6自由度相机姿态,这些姿态通过相机嵌入器注入生成模型,以确保准确的动作对齐。其次,我们使用全局相机姿态作为空间索引来检索相关的历史观察结果,使在长时导航过程中能够几何一致地重新访问位置。为了支持这项研究,我们引入了一个大规模数据集,包含3000分钟的真实人类游戏视频,并标注了相机轨迹和文本描述。广泛的实验表明,我们的方法在动作可控性、长时视觉质量和3D空间一致性方面显著优于最先进的交互式游戏世界模型。
Summary / 总结
This paper introduces WorldCam, which uses camera pose as a unifying geometric representation to improve action control and long-term 3D consistency in interactive gaming worlds. It defines a physics-based continuous action space and uses Lie algebra to derive precise 6-DoF camera poses, which are then injected into the generative model. Additionally, it uses global camera poses as spatial indices to ensure geometric consistency during long-horizon navigation. Experiments show that this approach outperforms existing models in action controllability, long-horizon visual quality, and 3D spatial consistency.
本文介绍了WorldCam,该方法将相机姿态作为统一的几何表示,以提高交互式游戏世界中的动作控制和长期3D一致性。它定义了一个基于物理的连续动作空间,并使用李代数来推导出精确的6-DoF相机姿态,这些姿态被注入生成模型中。该方法还使用全局相机姿态作为空间索引来在长时间导航中实现几何一致性重访。实验表明,WorldCam在动作可控性、长时视觉质量和3D空间一致性方面优于现有模型。
Demystifing Video Reasoning
Authors: Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang
Venue: www
First: 2026-03-17T17:59:55+00:00 · Latest: 2026-03-17T17:59:55+00:00
Comments: Homepage: https://www.wruisi.com/demystifying_video_reasoning
Abstract
Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.
中文标题/摘要
标题:揭秘视频推理
近期在视频生成方面的进展揭示了一个意想不到的现象:基于扩散的视频模型表现出非平凡的推理能力。先前的工作将这一现象归因于帧链(CoF)机制,推理被认为在视频帧之间顺序展开。在本项工作中,我们挑战了这一假设并揭示了一种根本不同的机制。我们表明,视频模型中的推理主要在去噪步骤中出现。通过定性分析和有针对性的探针实验,我们发现模型在早期去噪步骤中探索多个候选解决方案,并逐步收敛到最终答案,我们将其称为步骤链(CoS)。除了这一核心机制外,我们还识别出几种对模型性能至关重要的新兴推理行为:(1)工作记忆,使参考持久化;(2)自我纠正和增强,允许从错误的中间解决方案中恢复;(3)先感知后行动,早期步骤建立语义基础,后期步骤执行结构化操作。在去噪步骤中,我们进一步发现扩散变换器内部自我进化出的功能专业化,早期层编码密集的感知结构,中间层执行推理,后期层整合潜在表示。受这些见解的启发,我们提出了一种简单的无训练策略作为概念验证,展示了如何通过从具有不同随机种子的相同模型中组合潜在轨迹来提高推理能力。总体而言,我们的工作为理解视频生成模型中推理的出现提供了系统性的理解,为未来更好地利用视频模型固有的推理动态作为智能新基质的研究奠定了基础。
Summary / 总结
This work challenges the prevailing assumption that video reasoning in diffusion-based models occurs sequentially across frames, instead proposing a Chain-of-Steps (CoS) mechanism where reasoning primarily emerges along denoising steps. Key findings include models exploring multiple candidate solutions early on and progressively converging to a final answer. The study also identifies emergent behaviors such as working memory, self-correction, and perception before action. Additionally, it reveals functional specialization within diffusion transformers, with early layers encoding perceptual structure, middle layers executing reasoning, and later layers consolidating representations. A simple training-free strategy is proposed to improve reasoning by ensembling latent trajectories from identical models with different random seeds.
这项研究挑战了视频模型中的推理过程是在帧间顺序发生的假设,而是提出了一个步骤链(CoS)机制,推理主要在去噪步骤中出现。研究中还识别出了一些关键的推理行为,如工作记忆、自我纠正和感知先于行动,并表明早期去噪步骤会探索多个候选解决方案,最终收敛到一个最终答案。研究还揭示了扩散变换器内部的功能专业化,早期层编码感知结构,中间层执行推理,后期层整合潜在表示。提出了一种无需训练的策略,通过从具有不同随机种子的相同模型中组合潜在轨迹来提高推理能力,展示了更好地利用视频模型中固有的推理动态的潜力。
SegviGen: Repurposing 3D Generative Model for Part Segmentation
Authors: Lin Li, Haoran Feng, Zehuan Huang, Haohua Chen, Wenbo Nie, Shaohua Hou, Keqing Fan, Pan Hu, Sheng Wang, Buyu Li, Lu Sheng
First: 2026-03-17T17:59:51+00:00 · Latest: 2026-03-17T17:59:51+00:00
Comments: Project page: https://fenghora.github.io/SegviGen-Page/
Abstract
We introduce SegviGen, a framework that repurposes native 3D generative models for 3D part segmentation. Existing pipelines either lift strong 2D priors into 3D via distillation or multi-view mask aggregation, often suffering from cross-view inconsistency and blurred boundaries, or explore native 3D discriminative segmentation, which typically requires large-scale annotated 3D data and substantial training resources. In contrast, SegviGen leverages the structured priors encoded in pretrained 3D generative model to induce segmentation through distinctive part colorization, establishing a novel and efficient framework for part segmentation. Specifically, SegviGen encodes a 3D asset and predicts part-indicative colors on active voxels of a geometry-aligned reconstruction. It supports interactive part segmentation, full segmentation, and full segmentation with 2D guidance in a unified framework. Extensive experiments show that SegviGen improves over the prior state of the art by 40% on interactive part segmentation and by 15% on full segmentation, while using only 0.32% of the labeled training data. It demonstrates that pretrained 3D generative priors transfer effectively to 3D part segmentation, enabling strong performance with limited supervision. See our project page at https://fenghora.github.io/SegviGen-Page/.
中文标题/摘要
标题:SegviGen:重用3D生成模型进行部件分割
我们介绍了SegviGen,这是一种框架,它重新利用了原生的3D生成模型来进行3D部件分割。现有的管道要么通过蒸馏或多视图掩码聚合将强大的2D先验引入3D,这通常会导致跨视图不一致和边界模糊,要么探索原生的3D判别分割,这通常需要大量标注的3D数据和大量的训练资源。相比之下,SegviGen 利用预训练的3D生成模型中编码的结构先验来通过特征部件颜色化诱导分割,从而建立了一种新颖且高效的部件分割框架。具体来说,SegviGen 编码了一个3D资产,并在几何对齐的重建的活动体素上预测部件指示颜色。它支持交互式部件分割、完整分割以及带有2D指导的完整分割。广泛的实验表明,SegviGen 在交互式部件分割上的性能比之前的最佳方法提高了40%,在完整分割上的性能提高了15%,同时仅使用了0.32%的标注训练数据。它证明了预训练的3D生成先验可以有效地转移到3D部件分割中,从而在有限监督的情况下实现强大的性能。请参见我们的项目页面:https://fenghora.github.io/SegviGen-Page/
Summary / 总结
SegviGen is a framework that repurposes 3D generative models for 3D part segmentation, improving over previous methods by 40% on interactive part segmentation and 15% on full segmentation. It uses pretrained 3D generative model priors to colorize parts and requires only 0.32% of the labeled training data. This approach avoids the issues of cross-view inconsistency and blurred boundaries found in other methods and supports interactive and full segmentation in a unified framework.
SegviGen 是一个框架,利用 3D 生成模型的先验知识进行 3D 部件分割,相比之前的方法,在交互式部件分割上提高了 40%,在完整分割上提高了 15%。它通过部件颜色化诱导分割,仅需 0.32% 的标注训练数据。该方法支持在统一框架中进行交互式、完整和引导式完整分割,展示了 3D 生成先验知识在部件分割中的有效转移,即使在有限监督下也能实现良好性能。
Efficient Reasoning on the Edge
Authors: Yelysei Bondarenko, Thomas Hehn, Rob Hesselink, Romain Lepert, Fabio Valerio Massoli, Evgeny Mironov, Leyla Mirvakhabova, Tribhuvanesh Orekondy, Spyridon Stasis, Andrey Kuzmin, Anna Kuzina, Markus Nagel, Ankita Nayak, Corrado Rainone, Ork de Rooij, Paul N Whatmough, Arash Behboodi, Babak Ehteshami Bejnordi
First: 2026-03-17T17:59:51+00:00 · Latest: 2026-03-17T17:59:51+00:00
Comments: Project page: https://qualcomm-ai-research.github.io/llm-reasoning-on-edge/
Abstract
Large language models (LLMs) with chain-of-thought reasoning achieve state-of-the-art performance across complex problem-solving tasks, but their verbose reasoning traces and large context requirements make them impractical for edge deployment. These challenges include high token generation costs, large KV-cache footprints, and inefficiencies when distilling reasoning capabilities into smaller models for mobile devices. Existing approaches often rely on distilling reasoning traces from larger models into smaller models, which are verbose and stylistically redundant, undesirable for on-device inference. In this work, we propose a lightweight approach to enable reasoning in small LLMs using LoRA adapters combined with supervised fine-tuning. We further introduce budget forcing via reinforcement learning on these adapters, significantly reducing response length with minimal accuracy loss. To address memory-bound decoding, we exploit parallel test-time scaling, improving accuracy at minor latency increase. Finally, we present a dynamic adapter-switching mechanism that activates reasoning only when needed and a KV-cache sharing strategy during prompt encoding, reducing time-to-first-token for on-device inference. Experiments on Qwen2.5-7B demonstrate that our method achieves efficient, accurate reasoning under strict resource constraints, making LLM reasoning practical for mobile scenarios. Videos demonstrating our solution running on mobile devices are available on our project page.
中文标题/摘要
标题:边缘端高效推理
大型语言模型(LLMs)结合链式推理在复杂问题解决任务中达到最先进的性能,但其冗长的推理过程和大量上下文需求使其不适合边缘部署。这些挑战包括高标记生成成本、大的KV缓存占用空间以及在将推理能力精简到移动设备的小型模型时的效率低下。现有方法通常依赖于从大模型中提取冗长且风格冗余的推理痕迹,这不适合设备端推理。在本工作中,我们提出了一种轻量级方法,通过LoRA适配器结合监督微调来使小型LLM能够进行推理。我们进一步通过强化学习对这些适配器施加预算限制,显著减少了响应长度,同时保持最小的准确率损失。为了解决内存限制下的解码问题,我们利用并行测试时缩放,提高了准确性,同时仅增加轻微的延迟。最后,我们提出了一种动态适配器切换机制,仅在需要时激活推理,并在提示编码期间采用KV缓存共享策略,减少了设备端推理的首词时间。在Qwen2.5-7B上的实验表明,我们的方法在严格的资源限制下实现了高效且准确的推理,使LLM推理在移动场景中成为可能。有关我们解决方案在移动设备上运行的视频可在项目页面上找到。
Summary / 总结
This work addresses the impracticality of deploying large language models with chain-of-thought reasoning on edge devices due to high token generation costs and large memory footprints. The authors propose a lightweight approach using LoRA adapters and supervised fine-tuning, combined with budget forcing via reinforcement learning to reduce response length. They also introduce parallel test-time scaling and dynamic adapter-switching to improve accuracy and reduce latency. Experiments show that their method enables efficient and accurate reasoning in small LLMs, making it practical for mobile scenarios under strict resource constraints.
本文旨在解决具有链式推理能力的大语言模型(LLM)在边缘部署中的不实用性,由于高令牌生成成本和大上下文需求。作者提出了一种轻量级方法,结合了LoRA适配器和监督微调,并通过强化学习进行预算控制,以减少响应长度同时保持准确性。他们还引入了并行测试时缩放和动态适配器切换机制来提高效率。实验表明,他们的方法在严格资源约束下实现了高效且准确的推理,使LLM推理适用于移动场景。
Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory
Authors: Sahil Sen, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah
First: 2026-03-17T17:59:20+00:00 · Latest: 2026-03-17T17:59:20+00:00
Abstract
Recent advances in Large Language Models (LLMs) have enabled conversational AI agents to engage in extended multi-turn interactions spanning weeks or months. However, existing memory systems struggle to reason over temporally grounded facts and preferences that evolve across months of interaction and lack effective retrieval strategies for multi-hop, time-sensitive queries over long dialogue histories. We introduce Chronos, a novel temporal-aware memory framework that decomposes raw dialogue into subject-verb-object event tuples with resolved datetime ranges and entity aliases, indexing them in a structured event calendar alongside a turn calendar that preserves full conversational context. At query time, Chronos applies dynamic prompting to generate tailored retrieval guidance for each question, directing the agent on what to retrieve, how to filter across time ranges, and how to approach multi-hop reasoning through an iterative tool-calling loop over both calendars. We evaluate Chronos with 8 LLMs, both open-source and closed-source, on the LongMemEvalS benchmark comprising 500 questions spanning six categories of dialogue history tasks. Chronos Low achieves 92.60% and Chronos High scores 95.60% accuracy, setting a new state of the art with an improvement of 7.67% over the best prior system. Ablation results reveal the events calendar accounts for a 58.9% gain on the baseline while all other components yield improvements between 15.5% and 22.3%. Notably, Chronos Low alone surpasses prior approaches evaluated under their strongest model configurations.
中文标题/摘要
标题:Chronos:基于结构化事件检索的时间感知对话代理以支持长期记忆
近年来,大型语言模型(LLMs)的进步使对话AI代理能够进行持续数周或数月的多轮交互。然而,现有的记忆系统在处理长时间交互中不断演变的时间相关事实和偏好时存在困难,并且缺乏有效的多跳、时间敏感查询检索策略。我们引入了Chronos,这是一种新颖的时间感知记忆框架,将原始对话分解为具有解析时间范围和实体别名的主谓宾事件元组,并在事件日历和保留完整对话上下文的回合日历中进行结构化索引。在查询时,Chronos应用动态提示生成针对每个问题的定制检索指导,指导代理检索什么、如何在时间范围内过滤以及如何通过在两个日历上迭代工具调用循环中进行多跳推理来处理多跳推理。我们使用8个LLM,包括开源和闭源模型,在包含500个问题的LongMemEvalS基准测试中评估了Chronos,这些问题涵盖了六类对话历史任务。Chronos Low实现了92.60%的准确率,Chronos High达到了95.60%的准确率,创下了新的最先进的技术水平,比最佳先前系统提高了7.67%。消融结果表明,事件日历在基线上的改进率为58.9%,而其他所有组件的改进率在15.5%到22.3%之间。值得注意的是,仅Chronos Low就超过了在最强模型配置下评估的先前方法。
Summary / 总结
Chronos is a temporal-aware memory framework that decomposes dialogue into structured event tuples and uses a structured event calendar for long-term memory. It applies dynamic prompting to generate tailored retrieval guidance for each question, enabling effective multi-hop reasoning. Chronos achieves 92.60% and 95.60% accuracy on the LongMemEvalS benchmark with 8 LLMs, setting a new state of the art with significant improvements over prior systems.
Chronos 是一种时间感知的记忆框架,将对话分解为结构化的事件元组并索引在日历中。它使用动态提示来指导时间敏感查询的检索,支持多跳推理。在 LongMemEvalS 基准上的评估显示,Chronos Low 和 Chronos High 分别达到了 92.60% 和 95.60% 的准确率,创下了新的最佳水平,比之前的方法提高了 7.67%。消融研究显示,事件日历对性能提升贡献显著。
SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models
Authors: Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang, Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, Rongrong Ji
First: 2026-03-17T17:58:44+00:00 · Latest: 2026-03-17T17:58:44+00:00
Comments: Code is available at https://github.com/MAC-AutoML/SocialOmni and dataset is available at https://huggingface.co/datasets/alexisty/SocialOmni
Abstract
Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.
中文标题/摘要
标题:SocialOmni:全方位评估多模态社会互动能力
全方位多模态大型语言模型(OLMs)通过原生整合音频、视觉和文本重新定义了人机交互。然而,现有的OLM基准仍然局限于静态、准确性导向的任务,忽略了评估社会互动能力,这是导航自然对话中动态线索的基本能力。为此,我们提出了SocialOmni,一个全面的基准,用于在三个核心维度上操作化评估这种对话互动:(i)说话者分离和识别(谁在说话),(ii)打断时机控制(何时插话),以及(iii)自然打断生成(如何表达打断)。SocialOmni 包含2000个感知样本和一个包含209个交互生成实例的质量控制诊断集,这些实例具有严格的时序和语境约束,同时还包括受控的音频-视觉不一致性场景以测试模型的鲁棒性。我们对12个领先的OLMs进行了基准测试,揭示了它们在社会互动能力上的显著差异。此外,我们的分析表明,模型的感知准确性与其生成上下文适当打断的能力之间存在明显的脱节,这表明仅依靠理解导向的指标不足以表征对话社会能力。更令人鼓舞的是,SocialOmni的这些诊断为未来OLMs中的感知-互动鸿沟提供了可操作的信号。
Summary / 总结
SocialOmni is a benchmark for evaluating the social interactivity of omni-modal large language models (OLMs) across three dimensions: speaker separation and identification, interruption timing control, and natural interruption generation. It includes 2,000 perception samples and 209 interaction-generation instances with strict constraints, along with audio-visual inconsistency scenarios. Benchmarking 12 leading OLMs revealed significant differences in their social-interaction capabilities, highlighting the need for understanding-centric metrics to be complemented by interaction-centric ones. The diagnostics provide insights for improving future OLMs in handling social interactions.
SocialOmni 是一个用于评估 omni-modal 大型语言模型 (OLMs) 在三个维度上的人际互动能力的基准:说话者分离和识别、打断时机控制和自然打断生成。它包含 2,000 个感知样本和 209 个严格时间与上下文约束的交互生成实例。该基准测试了 12 个领先的 OLMs,并发现它们在人际互动能力上存在显著差异,强调需要将理解为中心的指标与交互为中心的指标相结合。这些诊断提供了改进未来 OLMs 的可操作信号。
Online Experiential Learning for Language Models
Authors: Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, Furu Wei
First: 2026-03-17T17:57:49+00:00 · Latest: 2026-03-17T17:57:49+00:00
Abstract
The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.
中文标题/摘要
标题:语言模型的在线体验学习
提高大型语言模型的主流方法依赖于离线训练和人类注释或模拟环境,而忽视了在实际部署过程中积累的丰富经验。我们提出了在线体验学习(OEL),这是一种框架,使语言模型能够从自身的部署经验中持续改进。OEL分为两个阶段:首先,从用户端收集的交互轨迹中提取并积累可转移的经验知识;其次,通过在线策略上下文蒸馏将这些知识整合到模型参数中,无需访问用户端环境。这两个阶段迭代形成一个在线学习循环,在此过程中,改进后的模型收集更高质量的轨迹,从而为后续轮次提供更丰富的经验知识。我们在多个模型规模和思考与非思考变体的基于文本的游戏环境中评估了OEL。OEL在连续迭代中实现了持续改进,提高了任务准确性和标记效率,同时保持了离群分布性能。我们的分析还表明,提取的经验知识比原始轨迹更有效,知识来源与策略模型之间的在线一致性对于有效学习至关重要。
Mediocrity is the key for LLM as a Judge Anchor Selection
Authors: Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen, Omri Abend
First: 2026-03-17T17:54:08+00:00 · Latest: 2026-03-17T17:54:08+00:00
Abstract
The ``LLM-as-a-judge'' paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the models. We further quantify the effect size of anchor selection, showing it is comparable to the selection of a judge model. We conclude with actionable recommendations. First, we conduct a power analysis, and compute sufficient benchmark sizes for anchor-based evaluation, finding that standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish between competitive models reliably. Second, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.
中文标题/摘要
标题:平庸是作为法官锚点选择的关键
``LLM作为法官''范式已成为评估开放式生成的标准方法。为了解决成对比较的二次可扩展性成本,像Arena-Hard和AlpacaEval这样的流行基准将所有模型与单一锚点进行比较。然而,尽管其广泛应用,锚点选择对结果可靠性的影响尚未得到充分探索。在本文中,我们系统地研究了锚点选择的影响,通过在Arena-Hard-v2.0数据集上评估22种不同的锚点。我们发现,锚点的选择至关重要:一个糟糕的锚点可以显著降低与人类排名的相关性。我们确定,常见的锚点选择(表现最佳和最差的模型)是糟糕的锚点。因为这些极端的锚点始终比其他所有模型表现更好或更差,所以它们很少能反映模型之间的相对排名。我们进一步量化了锚点选择的影响,表明其与法官模型的选择相当。最后,我们提出了一些可操作的建议。首先,我们进行了一次功效分析,计算了基于锚点评估所需的基准规模,发现标准基准规模对于成对评估是不足的,并且无法可靠地区分竞争模型。其次,我们提供了选择信息性锚点的指南,以确保可靠的和高效的评估实践。
Summary / 总结
This study investigates the impact of anchor selection on the reliability of LLM-as-a-judge evaluations by comparing 22 different anchors on the Arena-Hard-v2.0 dataset. It finds that the choice of anchor is crucial, with poor anchors significantly reducing correlation with human rankings. The study recommends using a power analysis to determine sufficient benchmark sizes and provides guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.
研究通过使用Arena-Hard-v2.0数据集评估22种不同的锚点,发现锚点的选择显著影响与人类排名的相关性。研究指出,最佳和最差表现的模型作为锚点是不合适的,因为它们不能准确反映模型之间的相对排名。研究建议采用足够的基准大小和选择信息性锚点的指南,以确保评估实践的可靠性和效率。
Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection
Authors: Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang
First: 2026-02-16T23:45:39+00:00 · Latest: 2026-03-17T17:54:04+00:00
Abstract
Standard transformer attention uses identical dimensionality for queries, keys, and values, yet these components serve
different roles: queries and keys produce scalar attention weights (selection), while values carry rich representations
(value transfer). We show that selection requires only $O(\log N)$ dimensions to distinguish among $N$ relevant token
categories (e.g., syntactic roles, semantic clusters, positional patterns) -- far fewer than value transfer needs.
We introduce factored keys, which exploit this asymmetry to physically shrink the KV cache of any pretrained model without
retraining from scratch -- unlike GQA and MLA, which must be designed into the architecture before pretraining. We factorize
each key projection $W_K \approx A_{d \times r} B_{r \times d}$ via truncated SVD (where $r = d_{\text{select}}$), set $W_K'
= A$ as the new key projection producing compact $r$-dimensional keys for the cache, and absorb $B^\top$ into the query
projection ($W_Q' = W_Q B^\top$) at zero cost -- since queries are never cached. At 7B scale, training from scratch with $r =
d_{\text{model}}/4$ matches full-attention perplexity (9.2 vs 9.3 PPL after 20B tokens) while using 12% fewer parameters and
training 8% faster. For existing models, SVD + QK fine-tuning (3 epochs, less than 1% of pretraining data) achieves 75% key
cache savings at approximately 2% quality cost on both GPT-2 and Mistral-7B. The approach composes with GQA and quantization
for up to $16\times$ combined key cache compression. For a 7B model serving 128K context, factored keys save 25 GB of KV
cache per user, enabling approximately 60% more concurrent users on identical hardware.
Summary / 总结
The paper aims to reduce the key-value (KV) cache size in transformer models by exploiting the different roles of queries, keys, and values. It introduces factored keys, which use low-dimensional attention selection to significantly shrink the KV cache without retraining. For a 7B model, this approach reduces the cache by 12% with minimal impact on performance, and for existing models, it achieves 75% key cache savings with a 2% quality cost.
本文旨在解决标准变压器注意力机制的低效问题,通过提出因子键来减少键的维度同时保持全维度的值。作者证明,选择只需要对数维度,而值传输需要更多维度。对于一个7B模型,使用因子键训练可以减少12%的参数量,加快8%的训练速度,同时达到与全注意力模型相当的困惑度。对于现有模型,通过SVD和QK微调可以实现75%的键缓存节省,同时质量损失极小。
M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM
Authors: Kerui Ren, Guanghao Li, Changjian Jiang, Yingxiang Xu, Tao Lu, Linning Xu, Junting Dong, Jiangmiao Pang, Mulin Yu, Bo Dai
First: 2026-03-17T17:52:37+00:00 · Latest: 2026-03-17T17:52:37+00:00
Comments: Project page: https://city-super.github.io/M3/
Abstract
Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate state-of-the-art accuracy in both pose estimation and scene reconstruction. Notably, M^3 reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforms ARTDECO by 2.11 dB in PSNR on the ScanNet++ dataset.
中文标题/摘要
标题:M^3:多视图基础模型与单目高斯点云SLAM的密集匹配结合
从未标定的单目视频流式重建仍然具有挑战性,因为它需要在动态环境中同时实现高精度的位姿估计和计算高效的在线优化。虽然将3D基础模型与SLAM框架结合是一种有前景的范式,但一个关键瓶颈仍然存在:大多数多视图基础模型以前馈方式估计位姿,产生像素级对应关系,缺乏进行严格几何优化所需的精度。为了解决这个问题,我们提出了M^3,它通过添加一个专门的匹配头来增强多视图基础模型,以促进细粒度的密集对应关系,并将其集成到鲁棒的单目高斯点云SLAM中。M^3进一步通过引入动态区域抑制和跨推理固有对齐来增强跟踪稳定性。在多种室内外基准上的广泛实验表明,M^3在位姿估计和场景重建方面的准确性达到了最先进的水平。值得注意的是,M^3将ATE RMSE降低了64.3%,与VGGT-SLAM 2.0相比,并且在ScanNet++数据集上的PSNR上比ARTDECO高出2.11 dB。
Summary / 总结
M^3 addresses the challenge of monocular SLAM in dynamic environments by integrating a dense matching head with a multi-view foundation model, leading to improved pose estimation and scene reconstruction accuracy. Experiments show a 64.3% reduction in ATE RMSE compared to VGGT-SLAM 2.0 and better PSNR than ARTDECO on ScanNet++.
M^3通过将多视图基础模型与专门的匹配头结合,以实现精确的密集对应关系,解决了动态环境下的单目SLAM挑战。该方法与单目高斯散射SLAM结合,提高了姿态估计和场景重建的准确性。实验结果显示,M^3在ATE RMSE上相比VGGT-SLAM 2.0降低了64.3%,在PSNR上相比ARTDECO提高了2.11 dB。
Internalizing Agency from Reflective Experience
Authors: Rui Ge, Yichao Fu, Yuyang Qian, Junda Su, Yiming Zhao, Peng Zhao, Hao Zhang
Venue: ICML 2026
First: 2026-03-17T17:50:47+00:00 · Latest: 2026-03-17T17:50:47+00:00
Comments: 17 pages, 5 figures; Submitted to ICML 2026
Abstract
Large language models are increasingly deployed as autonomous agents that must plan, act, and recover from mistakes through long-horizon interaction with environments that provide rich feedback. However, prevailing outcome-driven post-training methods (e.g., RL with verifiable rewards) primarily optimize final success signals, leaving rich environment feedback underutilized. Consequently, they often lead to distribution sharpening: the policy becomes better at reproducing a narrow set of already-successful behaviors, while failing to improve the feedback-grounded agency needed to expand problem-solving capacity (e.g., Pass@k) in long-horizon settings.
To address this, we propose LEAFE (Learning Feedback-Grounded Agency from Reflective Experience), a framework that internalizes recovery agency from reflective experience. Specifically, during exploration, the agent summarizes environment feedback into actionable experience, backtracks to earlier decision points, and explores alternative branches with revised actions. We then distill these experience-guided corrections into the model through supervised fine-tuning, enabling the policy to recover more effectively in future interactions. Across a diverse set of interactive coding and agentic tasks under fixed interaction budgets, LEAFE consistently improves Pass@1 over the base model and achieves higher Pass@k than outcome-driven baselines (GRPO) and experience-based methods such as Early Experience, with gains of up to 14% on Pass@128.
中文标题/摘要
标题:从反思经验中内化代理权
大型语言模型越来越多地被部署为自主代理,必须通过与提供丰富反馈的长期环境互动来进行规划、行动并从错误中恢复。然而,当前主要依赖于结果驱动的后训练方法(例如,具有可验证奖励的强化学习)主要优化最终的成功信号,而未能充分利用丰富的环境反馈。因此,这些方法往往导致分布锐化:策略变得更好于重现一组已经成功的特定行为,而未能提高基于反馈的代理权,以在长期环境中扩展问题解决能力(例如,Pass@k)。
为了解决这个问题,我们提出了LEAFE(从反思经验中学习基于反馈的代理权),这是一种框架,可以从反思经验中内化恢复代理权。具体来说,在探索过程中,代理将环境反馈总结为可操作的经验,回溯到早期的决策点,并探索带有修订行动的替代分支。然后,我们通过监督微调将这些经验指导的修正提炼到模型中,使策略在未来互动中能够更有效地恢复。在固定互动预算下的各种互动编程和代理任务中,LEAFE在Pass@1上始终优于基线模型,并且在Pass@k上优于结果驱动的基线(GRPO)和基于经验的方法(如早期经验),在Pass@128上的收益高达14%。
Summary / 总结
The research aims to enhance the agency of large language models by utilizing rich environment feedback during exploration. LEAFE (Learning Feedback-Grounded Agency from Reflective Experience) is proposed, which involves summarizing feedback, backtracking to earlier decision points, and exploring alternative actions. This method improves the model's ability to solve problems in long-horizon settings, achieving up to 14% better performance on Pass@128 compared to outcome-driven and experience-based baselines.
研究旨在通过利用丰富的环境反馈来增强大型语言模型在长时间交互中的自主性。提出了LEAFE(从反思经验中学习反馈导向的自主性)框架,该框架使代理能够反思过去的经历,回溯到早期的决策点,并探索不同的行动路径。这种方法提高了模型在未来交互中的恢复能力和扩展问题解决能力,使得Pass@1和Pass@k等性能指标优于结果导向和经验导向的方法,最高可提升14%的Pass@128。
Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning
Authors: Jello Zhou, Vudtiwat Ngampruetikorn, David J. Schwab
First: 2026-03-17T17:50:32+00:00 · Latest: 2026-03-17T17:50:32+00:00
Comments: 18 pages, 17 figures
Abstract
Stochastic resetting, where a dynamical process is intermittently returned to a fixed reference state, has emerged as a powerful mechanism for optimizing first-passage properties. Existing theory largely treats static, non-learning processes. Here we ask how stochastic resetting interacts with reinforcement learning, where the underlying dynamics adapt through experience. In tabular grid environments, we find that resetting accelerates policy convergence even when it does not reduce the search time of a purely diffusive agent, indicating a novel mechanism beyond classical first-passage optimization. In a continuous control task with neural-network-based value approximation, we show that random resetting improves deep reinforcement learning when exploration is difficult and rewards are sparse. Unlike temporal discounting, resetting preserves the optimal policy while accelerating convergence by truncating long, uninformative trajectories to enhance value propagation. Our results establish stochastic resetting as a simple, tunable mechanism for accelerating learning, translating a canonical phenomenon of statistical mechanics into an optimization principle for reinforcement learning.
中文标题/摘要
标题:随机重置加速强化学习策略收敛
随机重置,即动态过程间歇性地返回到固定参考状态,已成为优化首次通过性质的强大机制。现有理论主要处理静态、非学习过程。本文探讨了随机重置与强化学习的相互作用,其中底层动力学通过经验进行适应。在表格网格环境中,我们发现即使随机重置不减少纯扩散代理的搜索时间,它也能加速策略收敛,表明一种超越经典首次通过优化的新机制。在具有神经网络价值近似的连续控制任务中,我们展示了随机重置在探索困难和奖励稀疏时如何改善深度强化学习。与时间折扣不同,随机重置保留了最优策略,通过截断无信息的长轨迹来加速价值传播,从而提高收敛性。我们的结果将统计力学中的一个典型现象转化为强化学习的优化原则,确立了随机重置作为加速学习的简单可调机制。
Summary / 总结
The study investigates how stochastic resetting, a process that intermittently returns a dynamical system to a fixed state, affects reinforcement learning. In tabular grid environments, resetting accelerates policy convergence without reducing search time, suggesting a new mechanism beyond classical first-passage optimization. In continuous control tasks, random resetting improves deep reinforcement learning by enhancing value propagation and accelerating convergence, especially in sparse reward scenarios. Unlike temporal discounting, resetting preserves the optimal policy by truncating uninformative trajectories.
研究探讨了随机重置(将动态过程间歇性地恢复到固定状态)如何影响强化学习。在表格网格环境中,随机重置加速了策略收敛,而无需减少搜索时间,表明超越经典第一通过优化的新机制。在连续控制任务中,使用神经网络价值近似的随机重置通过截断无信息轨迹来改善学习,保持最优策略并增强价值传播,特别是在稀疏奖励场景中。
Learning to Present: Inverse Specification Rewards for Agentic Slide Generation
Authors: Karthik Ragunath Ananda Kumar, Subrahmanyam Arunachalam
First: 2026-03-17T17:45:53+00:00 · Latest: 2026-03-17T17:45:53+00:00
Comments: 12 pages, 11 figures, 13 tables, 26 references. Code: https://github.com/pushing-the-frontier/slide-forge-llm Dataset: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts
Abstract
Automated presentation generation remains a challenging task requiring coherent content creation, visual design, and audience-aware communication. This work proposes an OpenEnv-compatible reinforcement learning environment where LLM agents learn to research topics, plan content, and generate professional HTML slide presentations through tool use. We introduce a multi-component reward system combining structural validation, render quality assessment, LLM-based aesthetic scoring, content quality metrics, and an inverse specification reward that measures how faithfully generated slides convey their intended purpose. The inverse specification reward, an "inverse task" where an LLM attempts to recover the original specification from generated slides, provides a holistic quality signal. Our approach fine-tunes Qwen2.5-Coder-7B via GRPO, training only 0.5% of parameters on prompts derived from expert demonstrations collected using Claude Opus 4.6. Experiments on 48 diverse business briefs across six models demonstrate that our fine-tuned 7B model achieves 91.2% of Claude Opus 4.6's quality while improving 33.1% over the base model. The six-model comparison reveals that instruction adherence and tool-use compliance, rather than raw parameter count, determine agentic task performance. We contribute SlideRL, an open-source dataset of 288 multi-turn rollout trajectories across all six models: https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts Code: https://github.com/pushing-the-frontier/slide-forge-llm
中文标题/摘要
标题:学习展示:逆向规范奖励的代理幻灯片生成
自动演示文稿生成仍然是一个具有挑战性的任务,需要连贯的内容创作、视觉设计和面向观众的沟通。本研究提出了一种与OpenEnv兼容的强化学习环境,其中LLM代理通过工具使用学习研究主题、规划内容并生成专业的HTML幻灯片演示文稿。我们引入了一个多组件奖励系统,结合结构验证、渲染质量评估、基于LLM的美学评分、内容质量指标以及一个逆向规范奖励,该奖励衡量生成的幻灯片在多大程度上忠实传达了其预期目的。逆向规范奖励是一种“逆向任务”,其中LLM尝试从生成的幻灯片中恢复原始规范,提供了一个全面的质量信号。我们的方法通过GRPO对Qwen2.5-Coder-7B进行微调,仅在来自使用Claude Opus 4.6收集的专家演示的提示上训练0.5%的参数。在六个模型对48个不同业务简报的实验中,我们的微调7B模型达到了Claude Opus 4.6质量的91.2%,并且比基模型提高了33.1%。六模型比较表明,指令遵循和工具使用合规性,而不是参数数量,决定了代理任务的性能。我们贡献了SlideRL,一个包含288个多轮展开轨迹的开源数据集,涵盖了所有六个模型:https://huggingface.co/datasets/KarthikRagunathAnandaKumar/sliderl-multi-turn-rollouts 代码:https://github.com/pushing-the-frontier/slide-forge-llm
OrigamiBench: An Interactive Environment to Synthesize Flat-Foldable Origamis
Authors: Naaisha Agarwal, Yihan Wu, Yichang Jian, Yikuan Hu, Nishad Mansoor, Mohan Li, Yifei Peng, Wang-Zhou Dai, Yao-Xiang Ding, Emanuele Sansone
First: 2026-03-14T09:33:29+00:00 · Latest: 2026-03-17T17:36:55+00:00
Abstract
Building AI systems that can plan, act, and create in the physical world requires more than pattern recognition. Such systems must understand the causal mechanisms and constraints governing physical processes in order to guide sequential decisions. This capability relies on internal representations, analogous to an internal language model, that relate observations, actions, and resulting environmental changes. However, many existing benchmarks treat visual perception and programmatic reasoning as separate problems, focusing either on visual recognition or on symbolic tasks. The domain of origami provides a natural testbed that integrates these modalities. Constructing shapes through folding operations requires visual perception, reasoning about geometric and physical constraints, and sequential planning, while remaining sufficiently structured for systematic evaluation. We introduce OrigamiBench, an interactive benchmark in which models iteratively propose folds and receive feedback on physical validity and similarity to a target configuration. Experiments with modern vision-language models show that scaling model size alone does not reliably produce causal reasoning about physical transformations. Models fail to generate coherent multi-step folding strategies, suggesting that visual and language representations remain weakly integrated.
中文标题/摘要
标题:OrigamiBench:一种交互式环境以合成可平面折叠的纸艺作品
构建能够在物理世界中规划、行动和创造的AI系统需要的不仅仅是模式识别。这样的系统必须理解物理过程中的因果机制和约束,以便指导顺序决策。这种能力依赖于类似于内部语言模型的内部表示,将观察、行动和环境变化的结果联系起来。然而,许多现有的基准将视觉感知和程序化推理视为两个独立的问题,要么专注于视觉识别,要么专注于符号任务。折纸领域提供了一个自然的测试平台,可以整合这些模态。通过折叠操作构建形状需要视觉感知、几何和物理约束的推理以及顺序规划,同时保持足够的结构化以便系统性评估。我们介绍了OrigamiBench,这是一个交互式基准,在其中模型迭代地提出折叠并接收关于物理有效性和与目标配置相似性的反馈。现代视觉-语言模型的实验表明,仅扩大模型规模并不能可靠地产生关于物理变换的因果推理。模型无法生成连贯的多步折叠策略,这表明视觉和语言表示仍然结合得不够紧密。
Summary / 总结
The research aims to develop AI systems capable of understanding and manipulating the physical world, focusing on the domain of origami to integrate visual perception and symbolic reasoning. The main method involves creating OrigamiBench, an interactive environment where models propose folding operations and receive feedback on their validity and similarity to a target configuration. Key findings indicate that increasing model size does not necessarily lead to causal reasoning about physical transformations, and models struggle to generate coherent multi-step folding strategies, highlighting the need for better integration of visual and language representations.
研究旨在开发能够理解和操作物理世界的AI系统,通过因果推理和内部表示。主要方法是创建OrigamiBench,一个交互式环境,模型提出折叠动作并接收其有效性及与目标相似度的反馈。关键发现表明,仅扩大模型规模并不能提高对物理变换的因果推理能力,模型在生成连贯的多步折叠策略方面存在困难,这表明视觉和语言表示需要更好地整合。
Prompt Programming for Cultural Bias and Alignment of Large Language Models
Authors: Maksim Eren, Eric Michalak, Brian Cook, Johnny Seales
First: 2026-03-17T17:34:40+00:00 · Latest: 2026-03-17T17:34:40+00:00
Comments: 10 pages, pre-print
Abstract
Culture shapes reasoning, values, prioritization, and strategic decision-making, yet large language models (LLMs) often exhibit cultural biases that misalign with target populations. As LLMs are increasingly used for strategic decision-making, policy support, and document engineering tasks such as summarization, categorization, and compliance-oriented auditing, improving cultural alignment is important for ensuring that downstream analyses and recommendations reflect target-population value profiles rather than default model priors. Previous work introduced a survey-grounded cultural alignment framework and showed that culture-specific prompting can reduce misalignment, but it primarily evaluated proprietary models and relied on manual prompt engineering. In this paper, we validate and extend that framework by reproducing its social sciences survey based projection and distance metrics on open-weight LLMs, testing whether the same cultural skew and benefits of culture conditioning persist outside closed LLM systems. Building on this foundation, we introduce use of prompt programming with DSPy for this problem-treating prompts as modular, optimizable programs-to systematically tune cultural conditioning by optimizing against cultural-distance objectives. In our experiments, we show that prompt optimization often improves upon cultural prompt engineering, suggesting prompt compilation with DSPy can provide a more stable and transferable route to culturally aligned LLM responses.
中文标题/摘要
标题:提示编程以消除大型语言模型的文化偏见和实现文化对齐
文化塑造了推理、价值观、优先级和战略决策,然而,大型语言模型(LLMs)常常表现出文化偏见,这与目标人群的价值观不一致。随着LLMs在战略决策、政策支持和文档工程任务(如总结、分类和合规审计)中的应用越来越广泛,提高文化对齐变得尤为重要,以确保下游分析和建议反映目标人群的价值观,而不是默认模型的先验。先前的工作引入了一种基于调查的文化对齐框架,并表明文化特定的提示可以减少偏见,但主要评估了专有模型,并依赖于手动提示工程。在本文中,我们通过在开放权重LLMs上重现其社会学调查基于的投影和距离度量,验证并扩展了该框架,测试这种文化偏差和文化条件的好处是否在封闭LLM系统之外仍然存在。在此基础上,我们介绍了使用DSPy进行提示编程的方法,将提示视为可模块化、可优化的程序,以系统地通过优化文化距离目标来调整文化条件。在我们的实验中,我们展示了提示优化通常优于文化提示工程,表明使用DSPy进行提示编译可以提供一种更稳定和可转移的途径,以实现文化对齐的LLM响应。
Summary / 总结
The paper addresses the issue of cultural biases in large language models (LLMs) and introduces a framework for cultural alignment through survey-grounded cultural conditioning. By using prompt programming with DSPy, the authors systematically optimize prompts to reduce cultural misalignment. Experiments show that prompt optimization often outperforms manual prompt engineering, indicating that prompt compilation with DSPy can provide a more stable and transferable solution for culturally aligned LLM responses.
本文旨在解决大型语言模型(LLMs)中的文化偏见问题,确保它们与目标人群相匹配,特别是在战略决策任务方面。作者通过将其应用于开放权重LLMs来验证基于调查的文化对齐框架,并引入了使用DSPy的提示编程,以系统地调整文化对齐。实验表明,提示优化通常优于手动提示工程,表明了一种更稳定和可转移的方法来实现文化对齐的LLM响应。
Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers
Authors: Tuhin Chakrabarty, Jane C. Ginsburg, Paramveer Dhillon
First: 2025-10-15T17:51:58+00:00 · Latest: 2026-03-17T17:33:11+00:00
Comments: Preprint Under Review
Abstract
The use of copyrighted books for training AI has sparked lawsuits from authors concerned about AI generating derivative content. Yet whether these models can produce high-quality literary text emulating authors' voices remains unclear. We conducted a preregistered study comparing MFA-trained writers with three frontier models (ChatGPT, Claude, Gemini) writing up to 450-word excerpts emulating 50 award-winning authors' styles. In blind pairwise evaluations by 28 MFA-trained readers and 516 college-educated general readers, AI text from in-context prompting was strongly disfavored by MFA readers for stylistic fidelity (OR=0.16) and quality (OR=0.13), while general readers showed no fidelity preference (OR=1.06) but favored AI for quality (OR=1.82). Fine-tuning ChatGPT on authors' complete works reversed these results: MFA readers favored AI for fidelity (OR=8.16) and quality (OR=1.87), with general readers showing even stronger preference (fidelity OR=16.65; quality OR=5.42). Both groups preferred fine-tuned AI, but the writer-type X reader-type interaction remained significant (p=0.021 for fidelity; p<10^-4 for quality), indicating general readers favored AI by a wider margin. Effects are robust under cluster-robust inference and generalize across authors in heterogeneity analyses. Fine-tuned outputs were rarely flagged as AI-generated (3% vs. 97% for prompting) by leading detectors. Mediation analysis shows fine-tuning eliminates detectable AI quirks that penalize in-context outputs, altering the nexus between detectability and preference. While not accounting for effort to transform AI output into publishable prose, the median fine-tuning cost of $81 per author represents a 99.7% reduction versus typical writer compensation. Author-specific fine-tuning enables non-verbatim AI writing preferred over expert human writing, providing evidence relevant to copyright's fourth fair-use factor.
中文标题/摘要
标题:读者更偏好基于版权书籍训练的AI输出而非专家人类作家
使用版权书籍训练AI引发了作者的诉讼,他们担心AI生成衍生内容。然而,这些模型是否能产生高质量的文学文本,模仿作者的声音仍不清楚。我们进行了一项预先注册的研究,比较了MFA训练的作家与三个前沿模型(ChatGPT、Claude、Gemini)写作的450字左右的片段,模仿50位获奖作者的风格。在28名MFA训练的读者和516名受过高等教育的普通读者的盲法成对评估中,MFA读者对AI文本在风格忠实度(OR=0.16)和质量(OR=0.13)方面表现出强烈偏好,而普通读者在风格忠实度方面没有偏好(OR=1.06),但在质量方面更偏好AI(OR=1.82)。对ChatGPT进行作者完整作品的微调逆转了这些结果:MFA读者更偏好AI在忠实度(OR=8.16)和质量(OR=1.87)方面,普通读者的偏好更为强烈(忠实度OR=16.65;质量OR=5.42)。两组都更偏好微调后的AI,但作家类型与读者类型之间的交互作用仍然显著(忠实度p=0.021;质量p<10^-4),表明普通读者更强烈地偏好AI。在聚类稳健推断下,效果稳健,并在异质性分析中泛化。微调后的输出很少被领先检测器标记为AI生成(3% vs. 97%)。中介分析表明,微调消除了可检测到的AI怪癖,这些怪癖惩罚了上下文输出,改变了可检测性和偏好的关系。虽然没有考虑将AI输出转化为可发表文本所需的努力,但每位作者的微调成本中位数为81美元,与普通作家的报酬相比,减少了99.7%。针对作者的微调使非逐字AI写作更受欢迎,超过了专家人类写作,为版权第四项公平使用因素提供了相关证据。
Summary / 总结
The study aimed to evaluate whether AI trained on copyrighted books could produce high-quality literary text similar to expert human writers. Three frontier models (ChatGPT, Claude, Gemini) were compared with MFA-trained writers in writing excerpts emulating 50 award-winning authors' styles. Blind evaluations showed that MFA readers preferred fine-tuned AI over human writers for both stylistic fidelity and quality, while general readers favored AI for quality. Fine-tuning ChatGPT on authors' complete works reversed these results, with MFA readers and general readers both preferring fine-tuned AI. The study also found that fine-tuned AI outputs were rarely flagged as AI-generated by detectors, and the median fine-tuning cost was significantly lower than typical writer compensation.
该研究比较了AI生成的文本与专业作家的手稿,使用了三个先进模型(ChatGPT、Claude、Gemini)和MFA训练的作家。在MFA读者和普通读者的评估中,从上下文提示生成的AI文本在风格和质量上都不如人工生成的文本。然而,对ChatGPT进行作者完整作品的微调后,MFA读者和普通读者都更偏好微调后的AI文本。研究还发现,微调后的AI输出很少被检测为AI生成的文本,微调的成本也远低于人类作家的费用。
Real-Time Decoding of Movement Onset and Offset for Brain-Controlled Rehabilitation Exoskeleton
Authors: Kanishka Mitra, Satyam Kumar, Frigyes Samuel Racz, Deland Liu, Ashish D. Deshpande, José del R. Millán
Venue: ICRA 2026
First: 2026-03-17T17:32:43+00:00 · Latest: 2026-03-17T17:32:43+00:00
Comments: Accepted to ICRA 2026. 8 pages, 5 figures. Project page available at https://mitrakanishka.github.io/projects/startstop-bci/
Abstract
Robot-assisted therapy can deliver high-dose, task-specific training after neurologic injury, but most systems act primarily at the limb level-engaging the impaired neural circuits only indirectly-which remains a key barrier to truly contingent, neuroplasticity-targeted rehabilitation. We address this gap by implementing online, dual-state motor imagery control of an upper-limb exoskeleton, enabling goal-directed reaches to be both initiated and terminated directly from non-invasive EEG. Eight participants used EEG to initiate assistance and then volitionally halt the robot mid-trajectory. Across two online sessions, group-mean hit rates were 61.5% for onset and 64.5% for offset, demonstrating reliable start-stop command delivery despite instrumental noise and passive arm motion. Methodologically, we reveal a systematic, class-driven bias induced by common task-based recentering using an asymmetric margin diagnostic, and we introduce a class-agnostic fixation-based recentering method that tracks drift without sampling command classes while preserving class geometry. This substantially improves threshold-free separability (AUC gains: onset +56%, p = 0.0117; offset +34%, p = 0.0251) and reduces bias within and across days. Together, these results help bridge offline decoding and practical, intention-driven start-stop control of a rehabilitation exoskeleton, enabling precisely timed, contingent assistance aligned with neuroplasticity goals while supporting future clinical translation.
中文标题/摘要
标题:实时解码运动开始和结束以实现脑控康复外骨骼
机器人辅助疗法可以在神经损伤后提供高剂量、任务特异性的训练,但大多数系统主要在肢体层面起作用,仅间接激活受损的神经回路,这仍然是实现真正条件反射、针对神经可塑性的康复的关键障碍。我们通过实现在线的双状态运动想象控制上肢外骨骼,使参与者能够直接从非侵入性EEG启动并自愿在轨迹中停止机器人的帮助。在两次在线会话中,组平均击中率分别为61.5%的开始和64.5%的结束,尽管存在工具噪声和被动手臂运动,仍能可靠地传递启动-停止命令。方法上,我们揭示了一种由常见基于任务的重新中心化引起的系统性、类别驱动的偏差,并引入了一种无类别固定点重新中心化方法,该方法跟踪漂移而不采样命令类别,同时保持类别几何结构。这显著提高了阈值自由可分性(AUC增益:开始+56%,p=0.0117;结束+34%,p=0.0251),并在一天内和跨天内减少了偏差。这些结果有助于将离线解码与实际、意图驱动的启动-停止控制康复外骨骼相结合,实现与神经可塑性目标相一致的精确、条件反射性辅助,同时支持未来的临床转化。
Summary / 总结
The study aims to enhance brain-controlled rehabilitation exoskeletons by enabling direct control of movement onset and offset using non-invasive EEG. The researchers implemented an online system for dual-state motor imagery control of an upper-limb exoskeleton, allowing participants to initiate and halt robot-assisted reaches. Across two sessions, the group achieved hit rates of 61.5% for onset and 64.5% for offset, demonstrating reliable command delivery. Methodologically, they introduced a class-agnostic recentering method that improved separability and reduced bias, facilitating more precise and intention-driven control of the exoskeleton for neuroplasticity-targeted rehabilitation.
研究旨在通过使用非侵入性EEG直接控制运动的开始和结束,提升康复外骨骼的脑控能力。研究人员实现了一个在线系统,用于上肢外骨骼的双状态运动意象控制,允许参与者启动和停止机器人辅助的运动。在两个会话中,参与者达到了61.5%的开始准确率和64.5%的结束准确率,展示了可靠的命令传递能力。方法上,他们引入了一种无类别重中心化方法,提高了区分度并减少了偏差,从而实现了更精确和意图驱动的外骨骼控制,以实现神经可塑性目标。
Surg$Σ$: A Spectrum of Large-Scale Multimodal Data and Foundation Models for Surgical Intelligence
Authors: Zhitao Zeng, Mengya Xu, Jian Jiang, Pengfei Guo, Yunqiu Xu, Zhu Zhuo, Chang Han Low, Yufan He, Dong Yang, Chenxi Lin, Yiming Gu, Jiaxin Guo, Yutong Ban, Daguang Xu, Qi Dou, Yueming Jin
First: 2026-03-17T17:27:32+00:00 · Latest: 2026-03-17T17:27:32+00:00
Abstract
Surgical intelligence has the potential to improve the safety and consistency of surgical care, yet most existing surgical AI frameworks remain task-specific and struggle to generalize across procedures and institutions. Although multimodal foundation models, particularly multimodal large language models, have demonstrated strong cross-task capabilities across various medical domains, their advancement in surgery remains constrained by the lack of large-scale, systematically curated multimodal data. To address this challenge, we introduce Surg$Σ$, a spectrum of large-scale multimodal data and foundation models for surgical intelligence. At the core of this framework lies Surg$Σ$-DB, a large-scale multimodal data foundation designed to support diverse surgical tasks. Surg$Σ$-DB consolidates heterogeneous surgical data sources (including open-source datasets, curated in-house clinical collections and web-source data) into a unified schema, aiming to improve label consistency and data standardization across heterogeneous datasets. Surg$Σ$-DB spans 6 clinical specialties and diverse surgical types, providing rich image- and video-level annotations across 18 practical surgical tasks covering understanding, reasoning, planning, and generation, at an unprecedented scale (over 5.98M conversations). Beyond conventional multimodal conversations, Surg$Σ$-DB incorporates hierarchical reasoning annotations, providing richer semantic cues to support deeper contextual understanding in complex surgical scenarios. We further provide empirical evidence through recently developed surgical foundation models built upon Surg$Σ$-DB, illustrating the practical benefits of large-scale multimodal annotations, unified semantic design, and structured reasoning annotations for improving cross-task generalization and interpretability.
中文标题/摘要
标题:Surg$Σ$: 外科智能的多模态数据谱系与基础模型
外科智能有潜力提高外科护理的安全性和一致性,但大多数现有的外科人工智能框架仍局限于特定任务,难以在不同手术和机构之间泛化。尽管多模态基础模型,尤其是多模态大型语言模型,在各种医学领域展示了强大的跨任务能力,但在外科领域的进展受限于缺乏大规模、系统整理的多模态数据。为了解决这一挑战,我们引入了 Surg$Σ$,一种支持外科智能的大规模多模态数据和基础模型谱系。该框架的核心是 Surg$Σ$-DB,一种大规模多模态数据基础,旨在支持多种外科任务。Surg$Σ$-DB 将异构的外科数据源(包括开源数据集、内部临床收集和网络数据源)统一到一个模式中,旨在提高异构数据集之间的标签一致性和数据标准化。Surg$Σ$-DB 涵盖了 6 个临床专科和多种手术类型,提供了 18 个实用外科任务的丰富图像和视频级注释,覆盖了理解、推理、规划和生成,规模空前(超过 598 万次对话)。除了常规的多模态对话,Surg$Σ$-DB 还包含了层次推理注释,提供了更丰富的语义线索,以支持复杂外科场景中的深层次上下文理解。我们还通过基于 Surg$Σ$-DB 的新型外科基础模型提供了实证证据,展示了大规模多模态注释、统一语义设计和结构化推理注释对提高跨任务泛化能力和可解释性的实际益处。
Summary / 总结
Surg$Σ$ addresses the challenge of task-specific surgical AI by introducing a large-scale multimodal data foundation, Surg$Σ$-DB, which consolidates heterogeneous surgical data sources into a unified schema. This framework supports diverse surgical tasks across 6 clinical specialties and 18 practical surgical tasks, providing rich annotations. Empirical evidence from recently developed surgical foundation models built on Surg$Σ$-DB demonstrates improved cross-task generalization and interpretability due to large-scale multimodal annotations and structured reasoning annotations.
Surg$Σ$通过引入大规模多模态数据基础 Surg$Σ$-DB,整合了异构的手术数据源,形成了统一的模式,支持6个临床专科和18项实际手术任务,提供了丰富的注释。基于 Surg$Σ$-DB 开发的新型手术基础模型的实证研究表明,大规模多模态注释和结构化推理注释有助于提高跨任务泛化能力和可解释性。
Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models
Authors: Jihoon Jeong
First: 2026-03-05T01:49:29+00:00 · Latest: 2026-03-17T17:25:58+00:00
Comments: 56 pages, 7 figures. Project page: https://jihoonjeong.github.io/model-medicine/
Abstract
Model Medicine is the science of understanding, diagnosing, treating, and preventing disorders in AI models, grounded in the principle that AI models -- like biological organisms -- have internal structures, dynamic processes, heritable traits, observable symptoms, classifiable conditions, and treatable states. This paper introduces Model Medicine as a research program, bridging the gap between current AI interpretability research (anatomical observation) and the systematic clinical practice that complex AI systems increasingly require. We present five contributions: (1) a discipline taxonomy organizing 15 subdisciplines across four divisions -- Basic Model Sciences, Clinical Model Sciences, Model Public Health, and Model Architectural Medicine; (2) the Four Shell Model (v3.3), a behavioral genetics framework empirically grounded in 720 agents and 24,923 decisions from the Agora-12 program, explaining how model behavior emerges from Core--Shell interaction; (3) Neural MRI (Model Resonance Imaging), a working open-source diagnostic tool mapping five medical neuroimaging modalities to AI interpretability techniques, validated through four clinical cases demonstrating imaging, comparison, localization, and predictive capability; (4) a five-layer diagnostic framework for comprehensive model assessment; and (5) clinical model sciences including the Model Temperament Index for behavioral profiling, Model Semiology for symptom description, and M-CARE for standardized case reporting. We additionally propose the Layered Core Hypothesis -- a biologically-inspired three-layer parameter architecture -- and a therapeutic framework connecting diagnosis to treatment.
中文标题/摘要
标题:模型医学:理解、诊断和治疗AI模型的临床框架
模型医学是理解、诊断、治疗和预防AI模型紊乱的科学,基于AI模型——就像生物有机体一样——具有内部结构、动态过程、可遗传特征、可观察症状、可分类状况和可治疗状态的原则。本文介绍了模型医学作为研究计划,填补了当前AI可解释性研究(解剖观察)与复杂AI系统日益需要的系统临床实践之间的空白。我们提出了五项贡献:(1)一个学科分类,组织了四个部门下的15个子学科——基础模型科学、临床模型科学、模型公共卫生和模型建筑医学;(2)四层壳模型(v3.3),一个行为遗传框架,基于Agora-12计划中的720个代理和24,923个决策经验地建立,解释了模型行为如何从核心-壳相互作用中产生;(3)神经MRI(模型共振成像),一个工作中的开源诊断工具,将五种医学神经影像学模态映射到AI可解释性技术,通过四个临床案例验证了成像、比较、定位和预测能力;(4)一个五层诊断框架,用于全面的模型评估;(5)临床模型科学,包括模型气质指数进行行为特征分析、模型体征描述症状和M-CARE标准化病例报告。我们还提出了分层核心假设——一个生物启发的三层参数架构——以及将诊断与治疗连接起来的治疗框架。
Summary / 总结
Model Medicine is a research program that aims to understand, diagnose, and treat AI models by comparing them to biological organisms. It introduces a taxonomy of 15 subdisciplines and a behavioral genetics framework called the Four Shell Model. Key contributions include Neural MRI, a diagnostic tool, and a five-layer diagnostic framework. The research also proposes the Layered Core Hypothesis and clinical model sciences such as the Model Temperament Index and M-CARE for standardized reporting.
Model Medicine旨在通过将AI模型与生物体类比来理解、诊断和治疗它们。研究引入了15个子学科的分类体系和一个行为遗传框架,称为四壳模型。关键贡献包括Neural MRI,一个诊断工具,以及一个五层诊断框架。该研究还提出了层内核假设和临床模型科学,如模型气质指数和M-CARE,用于标准化报告。
Diffusion-DRF: Free, Rich, and Differentiable Reward for Video Diffusion Fine-Tuning
Authors: Yifan Wang, Yanyu Li, Gordon Guocheng Qian, Sergey Tulyakov, Yun Fu, Anil Kag
First: 2026-01-07T18:05:08+00:00 · Latest: 2026-03-17T17:21:55+00:00
Comments: Webpage: https://snap-research.github.io/diffusion-drf/
Abstract
Video diffusion alignment has been heavily relied on scalar rewards. These rewards are typically derived from learned reward models in human preference datasets, requiring additional training and extensive collection. Moreover, scalar rewards provide coarse, global supervision, offering limited prompt-generation mismatch credit assignment and making models prone to reward exploitation and unstable optimization. We propose Diffusion-DRF, a free, rich, and differentiable reward framework for video diffusion fine-tuning. Diffusion-DRF employs a frozen, off-the-shelf Vision-Language Model (VLM) as the critic, eliminating the need for reward model training. Instead of relying on a single scalar reward, it decomposes each user prompt into multi-dimensional questions with freeform dense VQA explanation queries, yielding information-rich feedback. By direct differentiable optimization over this rich feedback, Diffusion-DRF achieves stable reward-based tuning without preference datasets collection. Diffusion-DRF achieves significant gains both quantitatively and qualitatively, outperforming state-of-the-art Flow-GRPO by 4.74% in overall performance on unseen VBench-2.0.
中文标题/摘要
标题:扩散-DRF:免费、丰富且可微的奖励框架用于视频扩散微调
视频扩散对齐主要依赖于标量奖励。这些奖励通常来自人类偏好数据集中的学习奖励模型,需要额外的训练和广泛的收集。此外,标量奖励提供粗略的全局监督,对提示生成不匹配的信用分配有限,使模型容易受到奖励利用和不稳定优化的影响。我们提出了扩散-DRF,一种用于视频扩散微调的免费、丰富且可微的奖励框架。扩散-DRF 使用一个冻结的现成视觉-语言模型(VLM)作为批评者,消除了奖励模型训练的需要。它不依赖于单一的标量奖励,而是将每个用户提示分解为多维问题,并生成自由形式的密集VQA解释查询,提供丰富的反馈信息。通过直接对这种丰富反馈的可微优化,扩散-DRF 实现了稳定的基于奖励的微调,无需收集偏好数据集。扩散-DRF 在定量和定性方面均取得了显著的改进,在未见过的VBench-2.0 上的整体性能上优于最先进的Flow-GRPO 4.74%。
Summary / 总结
The paper proposes Diffusion-DRF, a reward framework for video diffusion fine-tuning that uses a frozen Vision-Language Model as a critic, avoiding the need for additional training. It decomposes user prompts into multi-dimensional questions with dense VQA feedback, providing rich and differentiable rewards. This approach leads to more stable optimization and outperforms state-of-the-art methods by 4.74% on unseen VBench-2.0.
Diffusion-DRF 是一种用于视频扩散微调的奖励框架,使用冻结的 Vision-Language 模型(VLM)作为批评者,避免了额外训练奖励模型的需要。它将用户提示分解为多维度的问题,并带有密集的 VQA 解释查询,提供丰富的反馈进行直接的可微优化。这种方法使得无需收集人类偏好数据即可实现稳定的奖励导向调优,并在 VBench-2.0 基准测试中比 Flow-GRPO 高出 4.74% 的性能。
Exploring Collatz Dynamics with Human-LLM Collaboration
Authors: Edward Y. Chang
First: 2026-03-10T02:07:00+00:00 · Latest: 2026-03-17T17:21:23+00:00
Comments: 127 pages, 11 figures, 13 tables
Abstract
We develop a quantitative framework for the Collatz conjecture through a human-LLM collaboration, combining exact arithmetic structure, cycle-level probabilistic laws, and a conditional convergence reduction. The central quantitative result is the Per-Orbit Gain Rate theorem, which proves R <= 0.0893 < epsilon = 2 - log_2 3 ~= 0.415, leaving a safety margin of at least 4.65x. A robustness corollary shows that exact equidistribution is unnecessary: it suffices that sum_K delta_K < 0.557. This promotes the Weak Mixing Hypothesis (WMH) to the primary open condition. On the arithmetic side, we refine modular crossing methods and prove that by depth 13 about 91 percent of odd residue classes are already forced to descend below their start. On the odd skeleton, we prove the exact run-length identity L(n) = v_2(n+1) - 1, derive an exact one-cycle crossing criterion, and compute the exact one-cycle crossing density P_1cyc = 0.713725498.... A major breakthrough is that the odd-skeleton valuation process satisfies an exact finite-block law: every prescribed valuation block occurs on a single odd residue class with the expected density. Hence the valuation process is exactly i.i.d. geometric in the natural-density ensemble, and the induced run-compensate cycle types are exactly i.i.d. This yields an exact cycle-level large-deviation theory and an unconditional almost-all crossing theorem in cycle language. We also prove substantial classwise deterministic crossing: about 41.9 percent of odd starts lie in one-cycle residue classes where every representative crosses below its start, and about 50.4 percent lie in two-cycle residue classes with the same universal crossing property. The framework does not yet prove Collatz. The remaining gap is now sharply isolated as a pointwise problem: proving that every deterministic orbit realizes enough of the exact negative cycle drift to cross below its start.
中文标题/摘要
标题:人类-LLM协作探索Collatz动态
我们通过人类-LLM协作开发了Collatz猜想的定量框架,结合精确算术结构、循环级概率定律和条件收敛简化。核心定量结果是每次轨道收益率定理,证明R <= 0.0893 < ε = 2 - log₂3 ≈ 0.415,留有至少4.65倍的安全余量。稳健性推论表明,精确等分布不是必需的:只要∑K δK < 0.557就足够了。这将弱混合假设(WMH)提升为主要的开放条件。在算术方面,我们细化了模交叉方法,并证明在深度13时约91%的奇数同余类已经被迫下降到其起点以下。在奇数骨架上,我们证明了精确的运行长度恒等式L(n) = v₂(n+1) - 1,推导出精确的一循环交叉准则,并计算出精确的一循环交叉密度P₁cyc = 0.713725498……一个重大突破是奇数骨架估值过程满足精确的有限块定律:每个指定的估值块在单一奇数同余类中以预期密度出现。因此,估值过程在自然密度集合中是精确的独立同分布几何过程,诱导的运行补偿循环类型也是精确的独立同分布。这产生了精确的循环级大偏差理论和循环语言中的几乎全部交叉定理。我们还证明了显著的分类确定性交叉:约41.9%的奇数起点位于每个代表都交叉到其起点以下的一循环同余类中,约50.4%位于具有相同普遍交叉性质的两循环同余类中。该框架尚未证明Collatz猜想。剩余的缺口现在被明确地隔离为一个点问题:证明每个确定性轨道实现了足够的精确负循环漂移以交叉到其起点以下。
Summary / 总结
The research aims to develop a quantitative framework for the Collatz conjecture through human-LLM collaboration, focusing on exact arithmetic structure, cycle-level probabilistic laws, and a conditional convergence reduction. Key findings include the Per-Orbit Gain Rate theorem proving R <= 0.0893 < epsilon = 2 - log_2 3, and the exact run-length identity L(n) = v_2(n+1) - 1, which shows that about 91 percent of odd residue classes descend below their start by depth 13. The framework also proves that the valuation process is exactly i.i.d. geometric, yielding an exact cycle-level large-deviation theory and an unconditional almost-all crossing theorem in cycle language.
该研究通过人类-LLM协作开发了Collatz猜想的定量框架,重点关注精确的算术结构、循环级概率定律和条件收敛减少。关键发现包括Per-Orbit Gain Rate定理,证明R <= 0.0893 < epsilon,以及精确的运行长度公式L(n) = v_2(n+1) - 1,显示大约91%的奇数残差类在深度13时会下降到其起始值以下。该框架还证明了奇骨架上的估值过程是精确的i.i.d.几何分布,从而获得精确的循环级大偏差理论和无条件的几乎全部穿越定理。
Is Conformal Factuality for RAG-based LLMs Robust? Novel Metrics and Systematic Insights
Authors: Yi Chen, Daiwei Chen, Sukrut Madhav Chikodikar, Caitlyn Heqi Yin, Ramya Korlakai Vinayak
First: 2026-03-17T17:20:08+00:00 · Latest: 2026-03-17T17:20:08+00:00
Comments: 56 pages
Abstract
Large language models (LLMs) frequently hallucinate, limiting their reliability in knowledge-intensive applications. Retrieval-augmented generation (RAG) and conformal factuality have emerged as potential ways to address this limitation. While RAG aims to ground responses in retrieved evidence, it provides no statistical guarantee that the final output is correct. Conformal factuality filtering offers distribution-free statistical reliability by scoring and filtering atomic claims using a threshold calibrated on held-out data, however, the informativeness of the final output is not guaranteed. We systematically analyze the reliability and usefulness of conformal factuality for RAG-based LLMs across generation, scoring, calibration, robustness, and efficiency. We propose novel informativeness-aware metrics that better reflect task utility under conformal filtering. Across three benchmarks and multiple model families, we find that (i) conformal filtering suffers from low usefulness at high factuality levels due to vacuous outputs, (ii) conformal factuality guarantee is not robust to distribution shifts and distractors, highlighting the limitation that requires calibration data to closely match deployment conditions, and (iii) lightweight entailment-based verifiers match or outperform LLM-based model confidence scorers while requiring over $100\times$ fewer FLOPs. Overall, our results expose factuality-informativeness trade-offs and fragility of conformal filtering framework under distribution shifts and distractors, highlighting the need for new approaches for reliability with robustness and usefulness as key metrics, and provide actionable guidance for building RAG pipelines that are both reliable and computationally efficient.
中文标题/摘要
标题:基于RAG的LLM的同构事实性稳健吗?新颖的度量标准与系统见解
大型语言模型(LLMs)经常产生幻觉,限制了它们在知识密集型应用中的可靠性。检索增强生成(RAG)和同构事实性已经作为解决这一限制的潜在方法出现。虽然RAG旨在使响应基于检索到的证据,但它不能提供统计保证,最终输出是正确的。同构事实性过滤通过在保留数据上校准阈值来对原子声明进行评分和过滤,从而提供无分布统计可靠性,然而,最终输出的信息量没有得到保证。我们系统地分析了同构事实性在生成、评分、校准、稳健性和效率方面的可靠性和有用性。我们提出了新的信息量感知度量标准,更好地反映了同构过滤下的任务实用性。在三个基准和多个模型家族中,我们发现:(i) 同构过滤在高事实性水平下由于空洞输出而具有低有用性;(ii) 同构事实性保证对分布偏移和干扰不稳健,突显了需要校准数据与部署条件紧密匹配的局限性;(iii) 轻量级蕴含验证器在模型置信度评分器上表现出色或优于其性能,同时需要超过100倍更少的FLOPs。总体而言,我们的结果揭示了事实性-信息量权衡以及同构过滤框架在分布偏移和干扰下的脆弱性,强调了需要新的方法以可靠性、稳健性和实用性作为关键指标,并为构建既可靠又计算高效的RAG管道提供了可操作的指导。
Summary / 总结
This study evaluates the robustness of conformal factuality for RAG-based LLMs by proposing new metrics and conducting systematic analysis. It finds that conformal filtering often produces vacuous outputs at high factuality levels, lacks robustness to distribution shifts and distractors, and that lightweight entailment-based verifiers outperform LLM-based model confidence scorers in terms of computational efficiency while maintaining comparable performance.
该研究通过提出新的度量标准并进行系统分析,评估了基于检索增强生成(RAG)的大语言模型(LLMs)中形式化事实性的稳健性。研究发现,形式化过滤在高事实性水平时经常产生空洞的输出,缺乏对分布偏移和干扰的鲁棒性,并且轻量级的蕴含验证器在可靠性和计算效率方面优于基于LLM的模型置信度评分器。
WildDepth: A Multimodal Dataset for 3D Wildlife Perception and Depth Estimation
Authors: Muhammad Aamir, Naoya Muramatsu, Sangyun Shin, Matthew Wijers, Jiaxing Jhong, Xinyu Hou, Amir Patel, Andrew Markham
First: 2026-03-17T17:19:43+00:00 · Latest: 2026-03-17T17:19:43+00:00
Abstract
Depth estimation and 3D reconstruction have been extensively studied as core topics in computer vision. Starting from rigid objects with relatively simple geometric shapes, such as vehicles, the research has expanded to address general objects, including challenging deformable objects, such as humans and animals. However, for the animal, in particular, the majority of existing models are trained based on datasets without metric scale, which can help validate image-only models. To address this limitation, we present WildDepth, a multimodal dataset and benchmark suite for depth estimation, behavior detection, and 3D reconstruction from diverse categories of animals ranging from domestic to wild environments with synchronized RGB and LiDAR. Experimental results show that the use of multi-modal data improves depth reliability by up to 10% RMSE, while RGB-LiDAR fusion enhances 3D reconstruction fidelity by 12% in Chamfer distance. By releasing WildDepth and its benchmarks, we aim to foster robust multimodal perception systems that generalize across domains.
中文标题/摘要
标题:WildDepth:一种用于野生动物感知和深度估计的多模态数据集
深度估计和三维重建一直是计算机视觉中的核心研究课题。从具有相对简单几何形状的刚性物体,如车辆,研究已经扩展到处理一般物体,包括具有挑战性的可变形物体,如人类和动物。然而,对于动物而言,现有的大多数模型都是基于没有度量尺度的数据集进行训练的,这有助于验证仅基于图像的模型。为了解决这一局限性,我们提出了WildDepth,这是一个多模态数据集和基准套件,用于从不同种类的动物中进行深度估计、行为检测和三维重建,这些动物包括从家养到野生环境,具有同步的RGB和LiDAR数据。实验结果表明,使用多模态数据可以将RMSE的深度可靠性提高多达10%,而RGB-LiDAR融合可以将Chamfer距离的三维重建精度提高12%。通过发布WildDepth及其基准测试,我们旨在促进跨领域的稳健多模态感知系统。
Summary / 总结
The research motivation is to improve depth estimation and 3D reconstruction for animals, which are more complex than rigid objects. The main method involves creating a multimodal dataset called WildDepth, which includes synchronized RGB and LiDAR data from various animal categories. Key experimental findings show that using multimodal data reduces RMSE by up to 10% and enhances 3D reconstruction fidelity by 12% in Chamfer distance compared to using RGB data alone.
研究旨在通过解决现有模型基于无度量尺度数据集训练的局限性,提高动物的深度估计和3D重建。该研究引入了WildDepth,这是一个包含同步RGB和LiDAR数据的多模态数据集,涵盖了各种动物类别。结果显示,使用多模态数据可将RMSE降低最多10%,并通过RGB-LiDAR融合提高3D重建精度12%。
Test-Time Adaptation via Many-Shot Prompting: Benefits, Limits, and Pitfalls
Authors: Shubhangi Upasani, Chen Wu, Jay Rainton, Bo Li, Urmish Thakker, Changran Hu, Qizheng Zhang
First: 2026-03-06T02:25:02+00:00 · Latest: 2026-03-17T17:10:32+00:00
Abstract
Test-time adaptation enables large language models (LLMs) to modify their behavior at inference without updating model parameters. A common approach is many-shot prompting, where large numbers of in-context learning (ICL) examples are injected as an input-space test-time update. Although performance can improve as more demonstrations are added, the reliability and limits of this update mechanism remain poorly understood, particularly for open-source models. We present an empirical study of many-shot prompting across tasks and model backbones, analyzing how performance varies with update magnitude, example ordering, and selection policy. We further study Dynamic and Reinforced ICL as alternative test-time update strategies that control which information is injected and how it constrains model behavior. We find that many-shot prompting is effective for structured tasks where demonstrations provide high information gain, but is highly sensitive to selection strategy and often shows limited benefits for open-ended generation tasks. Overall, we characterize the practical limits of prompt-based test-time adaptation and outline when input-space updates are beneficial versus harmful.
中文标题/摘要
标题:测试时适应通过多示例提示:优势、局限与风险
测试时适应使大型语言模型(LLMs)在推理时能够修改其行为而不更新模型参数。一种常见方法是多示例提示,即将大量上下文学习(ICL)示例作为输入空间的测试时更新注入。尽管随着示例数量的增加性能可以提高,但这种更新机制的可靠性和局限性仍然知之甚少,尤其是对于开源模型。我们对多示例提示在不同任务和模型架构上的进行了实证研究,分析了性能随更新幅度、示例排序和选择策略的变化。我们还研究了动态ICL和强化ICL作为替代的测试时更新策略,这些策略控制了注入哪些信息以及如何约束模型行为。我们发现,多示例提示对于结构化任务有效,其中示例提供了高信息增益,但对于开放生成任务则高度敏感且通常显示出有限的好处。总体而言,我们界定了基于提示的测试时适应的实用局限,并指出了输入空间更新是有益还是有害的情况。
Summary / 总结
The study explores test-time adaptation via many-shot prompting in large language models (LLMs), where numerous in-context learning examples are used to update model behavior without changing parameters. It finds that many-shot prompting is effective for structured tasks but is highly sensitive to the selection strategy and shows limited benefits for open-ended generation tasks. The research also compares it with Dynamic and Reinforced ICL to understand the limits and practicality of prompt-based test-time adaptation.
研究探讨了大规模语言模型(LLM)在不同任务和模型架构下的many-shot提示作为测试时适应技术的有效性。研究发现,many-shot提示对于提供显著信息增益的结构化任务是有效的,但对于开放生成任务则显示出有限的好处。研究还强调了many-shot提示对选择策略的敏感性和其对开源模型的局限性。
DexGrasp-Zero: A Morphology-Aligned Policy for Zero-Shot Cross-Embodiment Dexterous Grasping
Authors: Yuliang Wu, Yanhan Lin, WengKit Lao, Yuhao Lin, Yi-Lin Wei, Wei-Shi Zheng, Ancong Wu
First: 2026-03-17T17:10:29+00:00 · Latest: 2026-03-17T17:10:29+00:00
Abstract
To meet the demands of increasingly diverse dexterous hand hardware, it is crucial to develop a policy that enables zero-shot cross-embodiment grasping without redundant re-learning. Cross-embodiment alignment is challenging due to heterogeneous hand kinematics and physical constraints. Existing approaches typically predict intermediate motion targets and retarget them to each embodiment, which may introduce errors and violate embodiment-specific limits, hindering transfer across diverse hands. To overcome these limitations, we propose \textit{DexGrasp-Zero}, a policy that learns universal grasping skills from diverse embodiments, enabling zero-shot transfer to unseen hands. We first introduce a morphology-aligned graph representation that maps each hand's kinematic keypoints to anatomically grounded nodes and equips each node with tri-axial orthogonal motion primitives, enabling structural and semantic alignment across different morphologies. Relying on this graph-based representation, we design a \textit{Morphology-Aligned Graph Convolutional Network} (MAGCN) to encode the graph for policy learning. MAGCN incorporates a \textit{Physical Property Injection} mechanism that fuses hand-specific physical constraints into the graph features, enabling adaptive compensation for varying link lengths and actuation limits for precise and stable grasping. Our extensive simulation evaluations on the YCB dataset demonstrate that our policy, jointly trained on four heterogeneous hands (Allegro, Shadow, Schunk, Ability), achieves an 85\% zero-shot success rate on unseen hardware (LEAP, Inspire), outperforming the state-of-the-art method by 59.5\%. Real-world experiments further evaluate our policy on three robot platforms (LEAP, Inspire, Revo2), achieving an 82\% average success rate on unseen objects.
中文标题/摘要
标题:DexGrasp-Zero:一种形态对齐的零样本跨体态灵巧抓取策略
为了满足日益多样化的灵巧手硬件需求,开发一种无需冗余重新学习即可实现零样本跨体态抓取的策略至关重要。跨体态对齐由于手部运动学和物理约束的异质性而具有挑战性。现有方法通常预测中间运动目标并将其重新定向到每个体态,这可能会引入错误并违反特定体态的限制,阻碍在不同手部之间的转移。为克服这些限制,我们提出了一种名为\textit{DexGrasp-Zero}的策略,该策略从多种体态中学习通用的抓取技能,从而实现对未见过的手部的零样本转移。我们首先引入了一种形态对齐的图表示方法,将每个手的运动学关键点映射到解剖学基础的节点上,并为每个节点配备三轴正交运动基元,从而在不同形态之间实现结构和语义对齐。基于这种图表示方法,我们设计了一种\textit{形态对齐图卷积网络}(MAGCN)来编码图以供策略学习。MAGCN 包含一种\textit{物理属性注入}机制,将手部特定的物理约束融合到图特征中,从而实现对不同连杆长度和驱动限制的自适应补偿,以实现精确和稳定的抓取。我们在 YCB 数据集上的广泛仿真评估表明,我们的策略在四个异质手(Allegro、Shadow、Schunk、Ability)上联合训练后,在未见过的硬件(LEAP、Inspire)上实现了 85% 的零样本成功率,优于最先进的方法 59.5%。进一步的实地实验在三个机器人平台上(LEAP、Inspire、Revo2)评估了我们的策略,平均成功率为 82%。
Summary / 总结
DexGrasp-Zero is a policy designed for zero-shot cross-embodiment dexterous grasping by learning universal skills from diverse hand morphologies. It uses a morphology-aligned graph representation and a Morphology-Aligned Graph Convolutional Network (MAGCN) to encode the graph for policy learning, incorporating physical property injection to adapt to varying hand constraints. The policy achieves an 85% zero-shot success rate on unseen hardware in simulations and an 82% average success rate in real-world experiments, outperforming existing methods.
DexGrasp-Zero 是一种用于从多种手部形态中学习通用技能的策略,以实现零样本跨体态灵巧抓取。它使用形态对齐的图表示和形态对齐的图卷积网络(MAGCN)来编码图以进行策略学习,并结合物理属性注入来处理不同的手部约束。实验表明,DexGrasp-Zero 在未见过的硬件上实现了 85% 的零样本成功率,比现有方法高出 59.5%。
Gym-V: A Unified Vision Environment System for Agentic Vision Research
Authors: Fanqing Meng, Lingxiao Du, Jiawei Gu, Jiaqi Liao, Linjie Li, Zijian Wu, Xiangyan Liu, Ziqi Zhao, Mengkang Hu, Yue Zhang, Zichen Liu, Jiaheng Zhang, Michael Qizhe Shieh
First: 2026-03-16T15:37:07+00:00 · Latest: 2026-03-17T17:07:16+00:00
Abstract
As agentic systems increasingly rely on reinforcement learning from verifiable rewards, standardized ``gym'' infrastructure has become essential for rapid iteration, reproducibility, and fair comparison. Vision agents lack such infrastructure, limiting systematic study of what drives their learning and where current models fall short. We introduce \textbf{Gym-V}, a unified platform of 179 procedurally generated visual environments across 10 domains with controllable difficulty, enabling controlled experiments that were previously infeasible across fragmented toolkits. Using it, we find that observation scaffolding is more decisive for training success than the choice of RL algorithm, with captions and game rules determining whether learning succeeds at all. Cross-domain transfer experiments further show that training on diverse task categories generalizes broadly while narrow training can cause negative transfer, with multi-turn interaction amplifying all of these effects. Gym-V is released as a convenient foundation for training environments and evaluation toolkits, aiming to accelerate future research on agentic VLMs.
中文标题/摘要
标题:Gym-V:一种统一的视觉环境系统,用于自主视觉研究
随着自主系统越来越多地依赖于可验证奖励的强化学习,标准化的“gym”基础设施已成为快速迭代、可重复性和公平比较的必要条件。视觉代理缺乏这样的基础设施,限制了对其学习驱动因素的系统研究以及当前模型的不足之处。我们引入了**Gym-V**,这是一种包含10个领域179个可程序生成的视觉环境的统一平台,具有可控难度,使以前在分散的工具包中不可行的受控实验成为可能。使用它,我们发现观察支架比选择RL算法对训练成功更为关键,其中说明和游戏规则决定了学习是否成功。跨领域迁移实验进一步表明,针对多样任务类别进行训练可以广泛泛化,而狭窄的训练可能导致负迁移,多轮交互进一步放大了这些效果。Gym-V 作为训练环境和评估工具的基础发布,旨在加速对自主VLMs 的未来研究。
Summary / 总结
The research aims to provide a standardized environment for vision agents to facilitate the study of their learning processes and to enable fair comparisons. The main method involves creating a unified platform, Gym-V, with 179 procedurally generated visual environments across 10 domains. Key findings include that observation scaffolding is more critical for training success than the choice of reinforcement learning algorithm, and that cross-domain transfer experiments show diverse training generalizes better than narrow training, with multi-turn interaction amplifying these effects.
研究引入了Gym-V,这是一个包含10个领域共179个程序生成的视觉环境的统一平台,旨在促进对视觉代理学习过程的系统研究。研究发现,观察辅助比选择强化学习算法对训练成功更为关键,而描述和游戏规则显著影响学习结果。跨领域的迁移实验表明,对多样任务类别的训练能够很好地泛化,而狭窄的训练可能导致负迁移,多轮交互会放大这些效果。Gym-V旨在作为未来研究中训练环境和评估工具的基础,以加速对代理视觉学习模型的研究。
RaDAR: Relation-aware Diffusion-Asymmetric Graph Contrastive Learning for Recommendation
Authors: Yixuan Huang, Jiawei Chen, Shengfan Zhang, Zongsheng Cao
Venue: WWW 2026
First: 2026-03-17T17:05:23+00:00 · Latest: 2026-03-17T17:05:23+00:00
Comments: 12 pages, 5 figures. Accepted at WWW 2026
Abstract
Collaborative filtering (CF) recommendation has been significantly advanced by integrating Graph Neural Networks (GNNs) and Graph Contrastive Learning (GCL). However, (i) random edge perturbations often distort critical structural signals and degrade semantic consistency across augmented views, and (ii) data sparsity hampers the propagation of collaborative signals, limiting generalization.
To tackle these challenges, we propose RaDAR (Relation-aware Diffusion-Asymmetric Graph Contrastive Learning Framework for Recommendation Systems), a novel framework that combines two complementary view generation mechanisms: a graph generative model to capture global structure and a relation-aware denoising model to refine noisy edges.
RaDAR introduces three key innovations: (1) asymmetric contrastive learning with global negative sampling to maintain semantic alignment while suppressing noise; (2) diffusion-guided augmentation, which employs progressive noise injection and denoising for enhanced robustness; and (3) relation-aware edge refinement, dynamically adjusting edge weights based on latent node semantics.
Extensive experiments on three public benchmarks demonstrate that RaDAR consistently outperforms state-of-the-art methods, particularly under noisy and sparse conditions.
中文标题/摘要
标题:RaDAR:基于关系的扩散不对称图对比学习推荐
通过结合图神经网络(GNN)和图对比学习(GCL),协作过滤(CF)推荐得到了显著提升。然而,(i)随机边扰动往往扭曲了关键的结构信号,降低了不同增强视图之间的语义一致性;(ii)数据稀疏性限制了协作信号的传播,影响了泛化能力。
为了解决这些挑战,我们提出了RaDAR(推荐系统中基于关系的扩散不对称图对比学习框架),这是一种结合了两种互补视图生成机制的新框架:图生成模型以捕捉全局结构,以及关系感知的去噪模型以细化噪声边。
RaDAR 引入了三项关键创新:(1)具有全局负样本的不对称对比学习,以保持语义对齐并抑制噪声;(2)扩散引导增强,通过渐进式噪声注入和去噪增强鲁棒性;(3)关系感知边细化,根据潜在节点语义动态调整边权重。
在三个公开基准上的广泛实验表明,RaDAR 在各种条件下均优于现有方法。
Summary / 总结
RaDAR is a novel framework for recommendation systems that addresses the challenges of random edge perturbations and data sparsity by integrating a graph generative model and a relation-aware denoising model. It introduces asymmetric contrastive learning, diffusion-guided augmentation, and relation-aware edge refinement. Experimental results show that RaDAR outperforms existing methods, especially in noisy and sparse conditions.
RaDAR 是一种新型推荐系统框架,旨在解决传统图神经网络和图对比学习的局限性。它结合了图生成模型以捕捉全局结构和关系感知的去噪模型以精炼噪声边。关键创新包括不对称对比学习、全局负采样,扩散引导增强和关系感知边精炼。实验表明,RaDAR 在嘈杂和稀疏条件下优于现有方法。
Adaptive Moments are Surprisingly Effective for Plug-and-Play Diffusion Sampling
Authors: Christian Belardi, Justin Lovelace, Kilian Q. Weinberger, Carla P. Gomes
First: 2026-03-17T17:04:07+00:00 · Latest: 2026-03-17T17:04:07+00:00
Abstract
Guided diffusion sampling relies on approximating often intractable likelihood scores, which introduces significant noise into the sampling dynamics. We propose using adaptive moment estimation to stabilize these noisy likelihood scores during sampling. Despite its simplicity, our approach achieves state-of-the-art results on image restoration and class-conditional generation tasks, outperforming more complicated methods, which are often computationally more expensive. We provide empirical analysis of our method on both synthetic and real data, demonstrating that mitigating gradient noise through adaptive moments offers an effective way to improve alignment.
中文标题/摘要
标题:自适应矩在即插即用扩散采样中出人意料的有效性
引导扩散采样依赖于近似难以计算的似然分数,这会显著增加采样动态中的噪声。我们提出使用自适应矩估计来稳定采样过程中的这些噪声似然分数。尽管方法简单,但我们的方法在图像恢复和条件生成任务上达到了最先进的效果,优于更复杂的方法,后者通常计算成本更高。我们在合成和真实数据上提供了我们的方法的实证分析,证明通过自适应矩减轻梯度噪声是一种有效的改进对齐方式。
Summary / 总结
The paper addresses the issue of noise in guided diffusion sampling by proposing the use of adaptive moment estimation to stabilize likelihood scores. This simple method achieves state-of-the-art results in image restoration and class-conditional generation tasks, outperforming more complex alternatives which are often more computationally expensive. Empirical analysis on both synthetic and real data shows that reducing gradient noise through adaptive moments enhances alignment effectiveness.
论文通过提出使用自适应矩估计来稳定似然分数中的噪声,解决了引导扩散采样中的噪声问题。尽管方法简单,但在图像恢复和条件生成任务中仍达到了最先进的效果,优于更复杂的、通常计算成本更高的方法。在合成和真实数据上的实证分析表明,通过自适应矩估计减少梯度噪声可以提高对齐效果。
Efficient Continual Learning in Language Models via Thalamically Routed Cortical Columns
Authors: Afshin Khadangi
First: 2026-02-25T23:38:16+00:00 · Latest: 2026-03-17T17:02:11+00:00
Abstract
Large language models deployed in the wild must adapt to evolving data, user behavior, and task mixtures without erasing previously acquired capabilities. In practice, this remains difficult: sequential updates induce catastrophic forgetting, while many stabilization methods rely on external procedures that are costly, brittle, or difficult to scale. We present TRC$^{2}$ (Thalamically Routed Cortical Columns), a decoder-only architecture that makes continual adaptation a property of the backbone itself. TRC$^{2}$ combines stacked cortical columns with a thalamic modulatory pathway for selective inter-column communication and a hippocampal pathway for event-selective retrieval, delayed surprise-based writing, and replay-driven consolidation. This design localizes fast plasticity while preserving a slower stable computation pathway. We further introduce a causal memory-update scheme and an online replay controller that adjusts consolidation strength from measured forgetting. Across a task-sequential language-modeling stream over C4, WikiText-103, and GSM8K, TRC$^{2}$ consistently improves task-boundary modeling quality and substantially reduces cumulative forgetting relative to Transformer, Mamba, MoE, and DeepSeek baselines trained under the same pipeline. Ablations show that the thalamic and hippocampal components are central to the retention gains, while the full model remains competitive in throughput and training cost.
中文标题/摘要
标题:通过丘脑路由皮层柱实现语言模型的高效持续学习
部署在野外的大型语言模型必须适应不断变化的数据、用户行为和任务混合,而不抹去之前获得的能力。实际上,这仍然很困难:连续更新会导致灾难性遗忘,而许多稳定方法依赖于外部程序,这些程序成本高、脆弱或难以扩展。我们提出了TRC$^{2}$(丘脑路由皮层柱),这是一种仅解码器架构,使持续适应成为其核心结构的属性。TRC$^{2}$结合了堆叠的皮层柱与丘脑调节路径进行选择性跨柱通信,以及海马路径进行事件选择性检索、延迟惊讶驱动的写入和回放驱动的巩固。此设计局部化快速可塑性,同时保留一个较慢的稳定计算路径。我们还引入了一种因果记忆更新方案和一个在线回放控制器,根据测量的遗忘调整巩固强度。在C4、WikiText-103和GSM8K的任务序列语言建模流中,TRC$^{2}$在任务边界建模质量上始终优于Transformer、Mamba、MoE和DeepSeek基线,并且在相同管道下训练时显著减少了累积遗忘。消融实验表明,丘脑和海马组件对于保持收益至关重要,而完整模型在吞吐量和训练成本方面仍具有竞争力。
Summary / 总结
The paper addresses the challenge of continual learning in language models by proposing TRC$^{2}$ (Thalamically Routed Cortical Columns), which integrates stacked cortical columns with thalamic and hippocampal pathways for selective communication and event-based memory updates. This architecture improves task-boundary modeling quality and reduces cumulative forgetting compared to several baselines like Transformer and Mamba. Ablation studies confirm the importance of the thalamic and hippocampal components in achieving these improvements while maintaining competitive throughput and training costs.
论文提出TRC$^{2}$(Thalamically Routed Cortical Columns)架构,通过结合堆叠的皮层柱与thalamic和hippocampal路径,实现选择性通信和基于事件的记忆更新,以解决语言模型的持续学习问题。该架构在任务边界建模质量上有所提升,并且相对于Transformer、Mamba等基线模型,显著减少了累积遗忘。消融实验表明thalamic和hippocampal组件对于保持这些改进至关重要,同时模型在吞吐量和训练成本方面仍具有竞争力。
V-Co: A Closer Look at Visual Representation Alignment via Co-Denoising
Authors: Han Lin, Xichen Pan, Zun Wang, Yue Zhang, Chu Wang, Jaemin Cho, Mohit Bansal
First: 2026-03-17T17:01:54+00:00 · Latest: 2026-03-17T17:01:54+00:00
Comments: code: https://github.com/HL-hanlin/V-Co
Abstract
Pixel-space diffusion has recently re-emerged as a strong alternative to latent diffusion, enabling high-quality generation without pretrained autoencoders. However, standard pixel-space diffusion models receive relatively weak semantic supervision and are not explicitly designed to capture high-level visual structure. Recent representation-alignment methods (e.g., REPA) suggest that pretrained visual features can substantially improve diffusion training, and visual co-denoising has emerged as a promising direction for incorporating such features into the generative process. However, existing co-denoising approaches often entangle multiple design choices, making it unclear which design choices are truly essential. Therefore, we present V-Co, a systematic study of visual co-denoising in a unified JiT-based framework. This controlled setting allows us to isolate the ingredients that make visual co-denoising effective. Our study reveals four key ingredients for effective visual co-denoising. First, preserving feature-specific computation while enabling flexible cross-stream interaction motivates a fully dual-stream architecture. Second, effective classifier-free guidance (CFG) requires a structurally defined unconditional prediction. Third, stronger semantic supervision is best provided by a perceptual-drifting hybrid loss. Fourth, stable co-denoising further requires proper cross-stream calibration, which we realize through RMS-based feature rescaling. Together, these findings yield a simple recipe for visual co-denoising. Experiments on ImageNet-256 show that, at comparable model sizes, V-Co outperforms the underlying pixel-space diffusion baseline and strong prior pixel-diffusion methods while using fewer training epochs, offering practical guidance for future representation-aligned generative models.
中文标题/摘要
标题:V-Co:视觉表示对齐的共去噪方法研究
像素空间扩散最近重新成为与潜在扩散相比的强大替代方案,能够实现高质量生成而无需预训练自编码器。然而,标准的像素空间扩散模型接受相对较弱的语义监督,并未明确设计用于捕捉高层次的视觉结构。最近的表示对齐方法(例如REPA)表明,预训练的视觉特征可以显著提高扩散训练效果,而视觉共去噪已经成为了将这些特征融入生成过程的有前途的方向。然而,现有的共去噪方法往往将多种设计选择交织在一起,使得不清楚哪些设计选择是真正必要的。因此,我们提出了V-Co,一种在统一的即时框架下的系统研究视觉共去噪的方法。这种受控环境使我们能够分离出使视觉共去噪有效的关键成分。我们的研究揭示了四个关键成分,以实现有效的视觉共去噪。首先,保留特征特定的计算同时允许灵活的跨流交互,促使使用完全双流架构。其次,有效的无条件预测需要结构定义。第三,更强的语义监督最好由感知漂移混合损失提供。第四,稳定的共去噪还需要适当的跨流校准,我们通过基于RMS的特征重新缩放实现这一点。这些发现共同提供了一个简单的视觉共去噪配方。在ImageNet-256上的实验表明,与同等模型规模相比,V-Co在训练周期更少的情况下,优于基础的像素空间扩散基线和强大的先验像素扩散方法,提供了对未来对齐表示生成模型的实际指导。
Summary / 总结
The research aims to improve the effectiveness of visual co-denoising in pixel-space diffusion models by systematically studying its components. The study introduces V-Co, a framework that isolates key design choices for visual co-denoising. Key findings include the necessity of a dual-stream architecture, structurally defined unconditional prediction for classifier-free guidance, a perceptual-drifting hybrid loss for semantic supervision, and RMS-based feature rescaling for cross-stream calibration. These findings enhance the performance of visual co-denoising, making the model more effective and efficient compared to existing methods.
研究旨在通过系统研究视觉共去噪的组件来提高像素空间扩散模型的效果。研究引入了V-Co,一个统一框架,以隔离视觉共去噪的关键设计选择。主要发现包括双流架构的重要性、结构定义的无条件预测、感知漂移混合损失以及基于RMS的特征缩放以实现稳定的共去噪。实验表明,V-Co在更少的训练周期内优于现有方法,提供了未来表征对齐生成模型的实用指导。