arXiv 论文速递

Latest digest

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Authors: Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, Tianfan Xue

First: 2026-03-26T17:59:59+00:00 · Latest: 2026-03-26T17:59:59+00:00

Comments: Project Page: https://luo0207.github.io/ShotStream/ Code: https://github.com/KlingAIResearch/ShotStream

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame generation. By reformulating the task as next-shot generation conditioned on historical context, ShotStream allows users to dynamically instruct ongoing narratives via streaming prompts. We achieve this by first fine-tuning a text-to-video model into a bidirectional next-shot generator, which is then distilled into a causal student via Distribution Matching Distillation. To overcome the challenges of inter-shot consistency and error accumulation inherent in autoregressive generation, we introduce two key innovations. First, a dual-cache memory mechanism preserves visual coherence: a global context cache retains conditional frames for inter-shot consistency, while a local context cache holds generated frames within the current shot for intra-shot consistency. And a RoPE discontinuity indicator is employed to explicitly distinguish the two caches to eliminate ambiguity. Second, to mitigate error accumulation, we propose a two-stage distillation strategy. This begins with intra-shot self-forcing conditioned on ground-truth historical shots and progressively extends to inter-shot self-forcing using self-generated histories, effectively bridging the train-test gap. Extensive experiments demonstrate that ShotStream generates coherent multi-shot videos with sub-second latency, achieving 16 FPS on a single GPU. It matches or exceeds the quality of slower bidirectional models, paving the way for real-time interactive storytelling. Training and inference code, as well as the models, are available on our

中文标题/摘要

标题：ShotStream：流式多帧视频生成用于交互式叙事

多帧视频生成对于长叙事故事讲述至关重要，但当前的双向架构存在互动性有限和高延迟的问题。我们提出了一种名为ShotStream的新型因果多帧架构，该架构能够实现交互式叙事和高效的实时帧生成。通过将任务重新定义为基于历史上下文的下一帧生成，ShotStream允许用户通过流式提示动态指导正在进行的叙事。我们首先将文本到视频模型微调为双向下一帧生成器，然后通过分布匹配蒸馏生成一个因果学生。为了解决自回归生成中固有的帧间一致性问题和错误累积问题，我们引入了两项关键创新。首先，引入了双缓存机制以保持视觉连贯性：全局上下文缓存保留条件帧以实现帧间一致性，而局部上下文缓存保留当前帧以实现帧内一致性。并使用RoPE断点指示器明确区分两个缓存以消除歧义。其次，为了减轻错误累积，我们提出了一种两阶段蒸馏策略。该策略首先基于真实历史帧进行帧内自我强化，然后逐步扩展到基于自生成历史的帧间自我强化，从而有效弥合训练与测试之间的差距。大量实验表明，ShotStream能够在亚秒延迟下生成连贯的多帧视频，单GPU上可实现16 FPS。它在质量和速度上均能与较慢的双向模型匹敌，为实时交互式叙事铺平了道路。训练和推理代码以及模型均可在我们的项目页面上获取。

Summary / 总结

ShotStream is a novel causal multi-shot architecture designed for interactive storytelling, addressing the limitations of current bidirectional models in terms of interactivity and latency. By reformulating the task as next-shot generation conditioned on historical context, ShotStream enables dynamic user instructions through streaming prompts. Key innovations include a dual-cache memory mechanism for visual coherence and a two-stage distillation strategy to mitigate error accumulation, resulting in coherent multi-shot videos with sub-second latency and 16 FPS on a single GPU. This approach matches or exceeds the quality of slower bidirectional models, facilitating real-time interactive storytelling.

ShotStream 是一种新颖的因果多镜头架构，旨在实现交互式讲故事，解决了当前双向模型在互动性和延迟方面的局限性。通过将任务重新定义为基于历史上下文的下一镜头生成，ShotStream 允许用户通过流式提示动态提供指令。关键创新包括双缓存机制以保持视觉连贯性，以及两阶段蒸馏策略以减轻错误累积，从而实现亚秒级延迟和每秒 16 帧的连贯多镜头视频。这种方法的质量与较慢的双向模型相当或超过，促进了实时交互式讲故事。

Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting

Authors: Yixing Lao, Xuyang Bai, Xiaoyang Wu, Nuoyuan Yan, Zixin Luo, Tian Fang, Jean-Daniel Nahmias, Yanghai Tsin, Shiwei Li, Hengshuang Zhao

First: 2026-03-26T17:59:59+00:00 · Latest: 2026-03-26T17:59:59+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Existing feed-forward 3D Gaussian Splatting methods predict pixel-aligned primitives, leading to a quadratic growth in primitive count as resolution increases. This fundamentally limits their scalability, making high-resolution synthesis such as 4K intractable. We introduce LGTM (Less Gaussians, Texture More), a feed-forward framework that overcomes this resolution scaling barrier. By predicting compact Gaussian primitives coupled with per-primitive textures, LGTM decouples geometric complexity from rendering resolution. This approach enables high-fidelity 4K novel view synthesis without per-scene optimization, a capability previously out of reach for feed-forward methods, all while using significantly fewer Gaussian primitives. Project page: https://yxlao.github.io/lgtm/

中文标题/摘要

标题：少高斯，更多纹理：4K 前馈纹理斑点绘制

现有的前馈3D高斯斑点绘制方法预测像素对齐的原语，导致分辨率增加时原语数量呈二次增长。这从根本上限制了它们的可扩展性，使得高分辨率合成（如4K）难以实现。我们引入了LGTM（少高斯，更多纹理），这是一种前馈框架，克服了这一分辨率缩放障碍。通过预测紧凑的高斯原语并结合每个原语的纹理，LGTM 将几何复杂度与渲染分辨率脱钩。这种方法使得在无需针对每个场景进行优化的情况下即可实现高保真4K新颖视图合成，这是前馈方法此前无法实现的能力，同时使用了显著较少的高斯原语。项目页面：https://yxlao.github.io/lgtm/

RefAlign: Representation Alignment for Reference-to-Video Generation

Authors: Lei Wang, YuXin Song, Ge Wu, Haocheng Feng, Hang Zhou, Jingdong Wang, Yaxing Wang, jian Yang

First: 2026-03-26T17:59:57+00:00 · Latest: 2026-03-26T17:59:57+00:00

Comments: 17 pages, 11 figures

Abs · PDF · Code1 · Code2

Abstract

Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additional high-level semantic or cross-modal features alongside the VAE latent representation of the reference image and jointly feed them into the diffusion Transformer (DiT). These auxiliary representations provide semantic guidance and act as implicit alignment signals, which can partially alleviate pixel-level information leakage in the VAE latent space. However, they may still struggle to address copy--paste artifacts and multi-subject confusion caused by modality mismatch across heterogeneous encoder features. In this paper, we propose RefAlign, a representation alignment framework that explicitly aligns DiT reference-branch features to the semantic space of a visual foundation model (VFM). The core of RefAlign is a reference alignment loss that pulls the reference features and VFM features of the same subject closer to improve identity consistency, while pushing apart the corresponding features of different subjects to enhance semantic discriminability. This simple yet effective strategy is applied only during training, incurring no inference-time overhead, and achieves a better balance between text controllability and reference fidelity. Extensive experiments on the OpenS2V-Eval benchmark demonstrate that RefAlign outperforms current state-of-the-art methods in TotalScore, validating the effectiveness of explicit reference alignment for R2V tasks.

中文标题/摘要

标题：RefAlign：参考图像到视频生成的表示对齐

参考图像到视频（R2V）生成是一种可控的视频合成范式，通过使用文本提示和参考图像来约束生成过程，从而实现个性化广告和虚拟试穿等应用。实践中，现有的R2V方法通常在参考图像的VAE潜在表示中引入附加的高层语义或跨模态特征，并将它们与扩散变换器（DiT）联合输入。这些辅助表示提供了语义指导，并作为隐式对齐信号，可以在一定程度上缓解VAE潜在空间中的像素级信息泄露。然而，它们可能仍然难以解决由于异构编码器特征间的模态不匹配而引起的复制粘贴伪影和多主体混淆。在本文中，我们提出了一种表示对齐框架RefAlign，该框架显式地将DiT参考分支特征对齐到视觉基础模型（VFM）的语义空间。RefAlign的核心是一种参考对齐损失，该损失将同一主体的参考特征和VFM特征拉近以提高身份一致性，同时将不同主体的相应特征推开以增强语义可区分性。该简单而有效的策略仅在训练时应用，不会增加推理时的开销，并在文本可控性和参考保真度之间实现了更好的平衡。在OpenS2V-Eval基准上的广泛实验表明，RefAlign在TotalScore上优于当前最先进的方法，验证了显式参考对齐对R2V任务的有效性。

Summary / 总结

The motivation for RefAlign is to improve the quality of reference-to-video generation by addressing issues like copy-paste artifacts and multi-subject confusion. The method involves a representation alignment framework that aligns the reference features to the semantic space of a visual foundation model, using a reference alignment loss to enhance identity consistency and semantic discriminability. The key experimental finding is that RefAlign outperforms current state-of-the-art methods in TotalScore on the OpenS2V-Eval benchmark, demonstrating the effectiveness of explicit reference alignment.

该论文提出了RefAlign，一种参考对视频生成的表示对齐框架，通过将参考特征对齐到视觉基础模型的语义空间来提高身份一致性和语义可区分性。该方法增强了文本可控性和参考保真度之间的平衡，在OpenS2V-Eval基准测试中优于当前最先进的方法。

Vega: Learning to Drive with Natural Language Instructions

Authors: Sicheng Zuo, Yuxuan Li, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu

First: 2026-03-26T17:59:56+00:00 · Latest: 2026-03-26T17:59:56+00:00

Comments: Code is available at https://github.com/zuosc19/Vega

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personalized driving. To address this, we first construct a large-scale driving dataset (InstructScene) containing around 100,000 scenes annotated with diverse driving instructions with the corresponding trajectories. We then propose a unified Vision-Language-World-Action model, Vega, for instruction-based generation and planning. We employ the autoregressive paradigm to process visual inputs (vision) and language instructions (language) and the diffusion paradigm to generate future predictions (world modeling) and trajectories (action). We perform joint attention to enable interactions between the modalities and use individual projection layers for different modalities for more capabilities. Extensive experiments demonstrate that our method not only achieves superior planning performance but also exhibits strong instruction-following abilities, paving the way for more intelligent and personalized driving systems.

中文标题/摘要

标题：Vega：利用自然语言指令学习驾驶

视觉-语言-行动模型已重新塑造了自动驾驶，使其决策过程能够融入语言。然而，现有的大多数管道仅利用语言模态进行场景描述或推理，缺乏根据多样化用户指令进行个性化驾驶的灵活性。为解决这一问题，我们首先构建了一个包含约10万个场景的大规模驾驶数据集（InstructScene），每个场景都标注了多样化的驾驶指令及其对应的轨迹。然后，我们提出了一种统一的视觉-语言-世界-行动模型Vega，用于基于指令的生成和规划。我们采用自回归范式处理视觉输入（视觉）和语言指令（语言），采用扩散范式生成未来预测（世界建模）和轨迹（行动）。我们进行联合注意力以实现模态之间的交互，并为不同的模态使用独立的投影层以增强能力。广泛的实验表明，我们的方法不仅在规划性能上表现出色，还具有很强的指令遵循能力，为更智能和个性化的驾驶系统铺平了道路。

Summary / 总结

The research aims to enhance autonomous driving by incorporating natural language instructions into the decision-making process. To achieve this, the authors created a large dataset called InstructScene with diverse driving instructions and corresponding trajectories. They then developed Vega, a unified Vision-Language-World-Action model, which uses autoregressive and diffusion paradigms for processing visual and language inputs and generating future predictions and trajectories. The experiments show that Vega not only improves planning performance but also excels in following instructions, suggesting its potential for more intelligent and personalized driving systems.

研究旨在通过将自然语言指令纳入决策过程来提升自动驾驶能力。为此，作者创建了一个名为InstructScene的大规模数据集，包含多样化的驾驶指令及其对应的轨迹。他们随后开发了Vega统一的Vision-Language-World-Action模型，该模型使用自回归和扩散范式来处理视觉和语言输入并生成未来预测和轨迹。实验表明，Vega在规划方面表现出色，并能有效遵循各种驾驶指令，显示出其在更智能和个性化驾驶系统方面的潜力。

Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

Authors: Zehao Wang, Huaide Jiang, Shuaiwu Dong, Yuping Wang, Hang Qiu, Jiachen Li

Venue: CVPR 2026

First: 2026-03-26T17:59:54+00:00 · Latest: 2026-03-26T17:59:54+00:00

Comments: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026); Project website: https://dmw-cvpr.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end-to-end autonomous driving systems either optimize for generic objectives or rely on fixed driving modes, lacking the ability to adapt to individual preferences or interpret natural language intent. To address this gap, we propose Drive My Way (DMW), a personalized Vision-Language-Action (VLA) driving framework that aligns with users' long-term driving habits and adapts to real-time user instructions. DMW learns a user embedding from our personalized driving dataset collected across multiple real drivers and conditions the policy on this embedding during planning, while natural language instructions provide additional short-term guidance. Closed-loop evaluation on the Bench2Drive benchmark demonstrates that DMW improves style instruction adaptation, and user studies show that its generated behaviors are recognizable as each driver's own style, highlighting personalization as a key capability for human-centered autonomous driving. Our data and code are available at https://dmw-cvpr.github.io/.

中文标题/摘要

标题：Drive My Way：个性化驾驶行为的视觉-语言-行动模型偏好对齐

人类驾驶行为本质上是个性化的，由长期习惯塑造，并受到短期意图的影响。不同个体在不同情况下加速、刹车、变道、让行和超越的方式各不相同。然而，现有的端到端自动驾驶系统要么优化通用目标，要么依赖固定驾驶模式，缺乏适应个人偏好或解释自然语言意图的能力。为解决这一问题，我们提出了Drive My Way (DMW) 个性化视觉-语言-行动（VLA）驾驶框架，该框架能够与用户的长期驾驶习惯对齐，并适应实时用户指令。DMW 从我们收集的跨多个实际驾驶员和条件的个性化驾驶数据集中学习用户嵌入，并在规划过程中根据该嵌入调整策略，而自然语言指令则提供额外的短期指导。在Bench2Drive基准上的闭环评估表明，DMW 改善了风格指令的适应性，用户研究显示其生成的行为可以识别为每位驾驶员的独特风格，突显个性化是面向人类的自动驾驶的关键能力。我们的数据和代码可在https://dmw-cvpr.github.io/获取。

Summary / 总结

The research aims to address the lack of personalization in existing autonomous driving systems by developing a framework that aligns with individual driving preferences. Drive My Way (DMW) uses a Vision-Language-Action approach, learning from a personalized driving dataset and adapting to real-time user instructions. Experimental results show that DMW enhances the adaptation to style instructions and generates driving behaviors that closely match each driver's unique style, demonstrating the importance of personalization in human-centered autonomous driving systems.

研究旨在通过提出Drive My Way (DMW)个性化视觉-语言-行动框架来弥补现有自动驾驶系统缺乏个性化的不足。DMW通过学习个性化驾驶数据集并在策略规划中条件化用户嵌入，同时结合自然语言指令进行实时指导。实验结果表明，DMW增强了对风格指令的适应性，并生成了可识别为每位驾驶员自己风格的行为，突显了个性化在以人为中心的自动驾驶系统中的重要性。

MegaFlow: Zero-Shot Large Displacement Optical Flow

Authors: Dingxi Zhang, Fangjinhua Wang, Marc Pollefeys, Haofei Xu

First: 2026-03-26T17:59:51+00:00 · Latest: 2026-03-26T17:59:51+00:00

Comments: Project Page: https://kristen-z.github.io/projects/megaflow Code: https://github.com/cvg/megaflow

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Accurate estimation of large displacement optical flow remains a critical challenge. Existing methods typically rely on iterative local search or/and domain-specific fine-tuning, which severely limits their performance in large displacement and zero-shot generalization scenarios. To overcome this, we introduce MegaFlow, a simple yet powerful model for zero-shot large displacement optical flow. Rather than relying on highly complex, task-specific architectural designs, MegaFlow adapts powerful pre-trained vision priors to produce temporally consistent motion fields. In particular, we formulate flow estimation as a global matching problem by leveraging pre-trained global Vision Transformer features, which naturally capture large displacements. This is followed by a few lightweight iterative refinements to further improve the sub-pixel accuracy. Extensive experiments demonstrate that MegaFlow achieves state-of-the-art zero-shot performance across multiple optical flow benchmarks. Moreover, our model also delivers highly competitive zero-shot performance on long-range point tracking benchmarks, demonstrating its robust transferability and suggesting a unified paradigm for generalizable motion estimation. Our project page is at: https://kristen-z.github.io/projects/megaflow.

中文标题/摘要

标题：MegaFlow：零样本大位移光学流

大位移光学流的准确估计仍然是一个关键挑战。现有方法通常依赖于迭代的局部搜索或/和领域特定的微调，这严重限制了它们在大位移和零样本泛化场景中的性能。为了解决这一问题，我们引入了MegaFlow，这是一种简单而强大的零样本大位移光学流模型。MegaFlow 不依赖于高度复杂的、特定任务的架构设计，而是通过利用预训练的全局视觉变换器特征来适应强大的先验知识，从而生成时间上一致的运动场。特别是，我们通过利用预训练的全局视觉变换器特征将流估计形式化为全局匹配问题，这自然地捕捉到了大位移。随后，通过几次轻量级的迭代细化进一步提高亚像素精度。广泛的实验表明，MegaFlow 在多个光学流基准测试中实现了最先进的零样本性能。此外，我们的模型在长距离点跟踪基准测试中也表现出高度竞争力，这表明其鲁棒的迁移性，并暗示了一种通用运动估计的统一范式。我们的项目页面为：https://kristen-z.github.io/projects/megaflow

Summary / 总结

MegaFlow is a model designed for zero-shot large displacement optical flow, addressing the limitations of existing methods that rely on iterative local search or domain-specific fine-tuning. By leveraging pre-trained global Vision Transformer features and performing a few lightweight iterative refinements, MegaFlow achieves state-of-the-art zero-shot performance on multiple optical flow benchmarks and demonstrates robust transferability for long-range point tracking.

MegaFlow旨在解决在零样本情况下估计大位移光流的挑战。它利用预训练的全局Vision Transformer特征将光流估计问题表述为全局匹配问题，并通过轻量级的迭代细化进一步提高精度。实验表明，MegaFlow在多个光流基准测试中表现出色，并且在长距离点跟踪基准测试中也表现出很强的零样本性能，表明其在通用运动估计方面的潜力。

Training the Knowledge Base through Evidence Distillation and Write-Back Enrichment

Authors: Yuxing Lu, Xukai Zhao, Wei Wu, Jinzhuo Wang

First: 2026-03-26T17:59:49+00:00 · Latest: 2026-03-26T17:59:49+00:00

Comments: 15 pages

Abs · PDF · Code1 · Code2

Abstract

The knowledge base in a retrieval-augmented generation (RAG) system is typically assembled once and never revised, even though the facts a query requires are often fragmented across documents and buried in irrelevant content. We argue that the knowledge base should be treated as a trainable component and propose WriteBack-RAG, a framework that uses labeled examples to identify where retrieval succeeds, isolate the relevant documents, and distill them into compact knowledge units that are indexed alongside the original corpus. Because the method modifies only the corpus, it can be applied once as an offline preprocessing step and combined with any RAG pipeline. Across four RAG methods, six benchmarks, and two LLM backbones, WriteBack-RAG improves every evaluated setting, with gains averaging +2.14%. Cross-method transfer experiments further show that the distilled knowledge benefits RAG pipelines other than the one used to produce it, confirming that the improvement resides in the corpus itself.

中文标题/摘要

标题：通过证据提炼和写回丰富训练知识库

在检索增强生成（RAG）系统中，知识库通常只组装一次且从不修订，尽管查询所需的事实往往分散在多个文档中并埋藏在无关内容中。我们认为知识库应被视为可训练的组件，并提出了一种名为WriteBack-RAG的框架，该框架使用标记示例来识别检索成功的地方，隔离相关文档，并将它们提炼成紧凑的知识单元，与原始语料库一起索引。由于该方法仅修改语料库，因此可以作为一次性的离线预处理步骤应用，并与任何RAG流水线结合使用。在四种RAG方法、六个基准和两种LLM基础模型上，WriteBack-RAG改进了所有评估设置，平均增益为+2.14%。跨方法转移实验进一步表明，提炼的知识对除生成它的RAG流水线之外的其他RAG流水线也有益，这证实了改进存在于语料库本身。

Summary / 总结

The paper addresses the issue of static knowledge bases in RAG systems, which are often incomplete and hard to update. It introduces WriteBack-RAG, a framework that uses labeled examples to refine the knowledge base by isolating relevant documents and distilling them into compact knowledge units. This method improves performance across various RAG methods and benchmarks, with an average gain of +2.14%. The distilled knowledge benefits different RAG pipelines, indicating that the improvements are due to the enhanced corpus.

论文针对RAG系统中静态的知识库不进行修订的问题，指出虽然所需事实往往分散且被埋没，但知识库却未被更新。提出了一种WriteBack-RAG框架，利用标记示例识别相关文档，将其提炼成紧凑的知识单元并整合到原始语料库中。该方法在多种RAG方法和基准测试中均表现出改进，平均提升幅度为+2.14%，并且表明提炼的知识对其他RAG管道也有益处。

SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

Authors: Jiwook Han, Geo Ahn, Youngrae Kim, Jinwoo Choi

Venue: CVPR 2026

First: 2026-03-26T17:59:31+00:00 · Latest: 2026-03-26T17:59:31+00:00

Comments: Accepted to GRAIL-V workshop at CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memorize dataset-specific shortcuts rather than faithfully grounding in the actual visual content, leading to poor Out-of-Domain (OOD) generalization. Object-centric learning offers a promising remedy by decomposing scenes into entity-level representations, but existing approaches require re-running the entire multi-stage training pipeline from scratch. We propose SlotVTG, a framework that steers MLLMs toward object-centric, input-grounded visual reasoning at minimal cost. SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots via slot attention and reconstructs the original sequence, where objectness priors from a self-supervised vision model encourage semantically coherent slot formation. Cross-domain evaluation on standard VTG benchmarks demonstrates that our approach significantly improves OOD robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.

中文标题/摘要

标题：SlotVTG: 以对象为中心的通用视频时间定位适配器

多模态大型语言模型（MLLMs）在视频时间定位（VTG）任务上表现出强大的性能。然而，它们粗略的识别能力不足以进行细粒度的时间理解，因此需要进行任务特定的微调。这种微调会导致模型记住数据集特定的捷径，而不是忠实于实际的视觉内容，从而导致较差的域外（OOD）泛化能力。以对象为中心的学习提供了一种有希望的解决方案，通过将场景分解为实体级表示，但现有方法需要从头开始重新运行整个多阶段训练管道。我们提出了SlotVTG，这是一种框架，通过轻量级的槽适配器引导MLLMs进行以对象为中心、输入导向的视觉推理，同时成本较低。SlotVTG 引入了一种轻量级的槽适配器，通过槽注意力将视觉标记分解为抽象槽，并重建原始序列，其中来自自监督视觉模型的对象性先验鼓励语义上一致的槽形成。跨域评估表明，我们的方法在保持与最小开销的域内（ID）性能的同时，显著提高了域外（OOD）鲁棒性。

Summary / 总结

The research aims to enhance the fine-grained temporal understanding of Video Temporal Grounding (VTG) by addressing the limitations of Multimodal Large Language Models (MLLMs). SlotVTG introduces a lightweight slot adapter that decomposes visual tokens into abstract slots using slot attention, encouraging semantically coherent slot formation with objectness priors from a self-supervised vision model. Experimental results show that SlotVTG significantly improves Out-of-Domain (OOD) robustness while maintaining competitive In-Domain (ID) performance with minimal overhead.

研究旨在通过提出SlotVTG框架解决Multimodal Large Language Models (MLLMs)在Video Temporal Grounding (VTG)中的局限性，该框架增强了细粒度的时间理解，而不只是记忆特定数据集的捷径。SlotVTG引入了一个轻量级的槽适配器，将视觉标记分解为抽象的槽，并重建原始序列，使用对象性先验来促进语义上一致的槽形成。实验结果表明，SlotVTG在保持与In-Domain (ID)性能的同时，显著提高了Out-of-Domain (OOD)的鲁棒性，且具有最小的开销。

BizGenEval: A Systematic Benchmark for Commercial Visual Content Generation

Authors: Yan Li, Zezi Zeng, Ziwei Zhou, Xin Gao, Muzhao Tian, Yifan Yang, Mingxi Cheng, Qi Dai, Yuqing Yang, Lili Qiu, Zhendong Wang, Zhengyuan Yang, Xue Yang, Lijuan Wang, Ji Li, Chong Luo

First: 2026-03-26T17:59:16+00:00 · Latest: 2026-03-26T17:59:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in image generation models have expanded their applications beyond aesthetic imagery toward practical visual content creation. However, existing benchmarks mainly focus on natural image synthesis and fail to systematically evaluate models under the structured and multi-constraint requirements of real-world commercial design tasks. In this work, we introduce BizGenEval, a systematic benchmark for commercial visual content generation. The benchmark spans five representative document types: slides, charts, webpages, posters, and scientific figures, and evaluates four key capability dimensions: text rendering, layout control, attribute binding, and knowledge-based reasoning, forming 20 diverse evaluation tasks. BizGenEval contains 400 carefully curated prompts and 8000 human-verified checklist questions to rigorously assess whether generated images satisfy complex visual and semantic constraints. We conduct large-scale benchmarking on 26 popular image generation systems, including state-of-the-art commercial APIs and leading open-source models. The results reveal substantial capability gaps between current generative models and the requirements of professional visual content creation. We hope BizGenEval serves as a standardized benchmark for real-world commercial visual content generation.

中文标题/摘要

标题：BizGenEval：商业视觉内容生成的系统基准

近期图像生成模型的进步使其应用范围从美学图像扩展到实用的视觉内容创作。然而，现有的基准主要集中在自然图像合成，未能系统地评估模型在现实世界商业设计任务的结构化和多约束要求下的表现。在本文中，我们介绍了BizGenEval，一个商业视觉内容生成的系统基准。该基准涵盖了五种代表性的文档类型：幻灯片、图表、网页、海报和科学图表，并评估了四个关键能力维度：文本渲染、布局控制、属性绑定和基于知识的推理，形成了20个不同的评估任务。BizGenEval包含400个精心策划的提示和8000个人工验证的检查清单问题，以严格评估生成的图像是否满足复杂的视觉和语义约束。我们在26个流行的图像生成系统上进行了大规模基准测试，包括最先进的商业API和领先的开源模型。结果揭示了当前生成模型与专业视觉内容创作要求之间的能力差距。我们希望BizGenEval能够作为现实世界商业视觉内容生成的标准基准。

Summary / 总结

BizGenEval is a systematic benchmark for evaluating commercial visual content generation models, focusing on five document types and four key capabilities. It includes 400 prompts and 8000 checklist questions to assess complex visual and semantic constraints. The benchmark tests 26 popular image generation systems and highlights significant gaps between current models and professional design requirements.

BizGenEval 是一个系统性的基准，用于评估商业视觉内容生成模型。它涵盖了五种文档类型，并评估了四种关键能力，形成了20项不同的任务。基准包括400个提示和8000个检查清单问题，以严格评估视觉和语义约束。研究结果显示，当前模型与专业设计要求之间存在显著差距。

PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference

Authors: Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, Kaipeng Zhang

First: 2026-03-26T17:59:05+00:00 · Latest: 2026-03-26T17:59:05+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-$k$ context selection mechanism for the mid tokens, coupled with a continuous Temporal RoPE Adjustment that seamlessly re-aligns position gaps caused by dropped tokens with negligible overhead. Empowered by this principled hierarchical context compression, PackForcing can generate coherent 2-minute, 832x480 videos at 16 FPS on a single H200 GPU. It achieves a bounded KV cache of just 4 GB and enables a remarkable 24x temporal extrapolation (5s to 120s), operating effectively either zero-shot or trained on merely 5-second clips. Extensive results on VBench demonstrate state-of-the-art temporal consistency (26.07) and dynamic degree (56.25), proving that short-video supervision is sufficient for high-quality, long-video synthesis. https://github.com/ShandaAI/PackForcing

中文标题/摘要

标题：PackForcing: 短视频培训足以支持长视频采样和长上下文推理

自回归视频扩散模型已经取得了显著的进步，但它们仍然受到难以处理的线性KV缓存增长、时间重复和长视频生成过程中累积错误的限制。为了解决这些挑战，我们提出了PackForcing，这是一种统一框架，通过新颖的三部分KV缓存策略高效管理生成历史。具体来说，我们将历史上下文分为三种类型：（1）Sink令牌，保留早期锚定帧的全分辨率以保持全局语义；（2）Mid令牌，通过结合逐进3D卷积和低分辨率VAE重新编码的双分支网络实现大规模时空压缩（32倍令牌减少）；（3）Recent令牌，保持全分辨率以确保局部时间连贯性。为了在不牺牲质量的情况下严格限制内存占用，我们引入了Mid令牌的动态top-$k$上下文选择机制，并结合连续的时空RoPE调整，无缝地重新对齐由于丢失令牌而产生的位置间隙，几乎没有额外开销。借助这种有原则的分层上下文压缩，PackForcing可以在单个H200 GPU上生成连贯的2分钟、832x480视频，帧率为16 FPS。它实现了仅4 GB的KV缓存，并能够实现显著的24倍时间外推（5秒到120秒），既可以零样本运行，也可以仅在5秒片段上进行训练。在VBench上的广泛结果表明，PackForcing在时间一致性（26.07）和动态程度（56.25）方面达到了最先进的水平，证明了短视频监督足以支持高质量的长视频合成。

Summary / 总结

PackForcing addresses the challenges of long-video generation in autoregressive video diffusion models by introducing a three-partition KV-cache strategy. It categorizes historical context into sink, mid, and recent tokens to manage memory efficiently while maintaining quality. PackForcing can generate 2-minute videos at 16 FPS on a single H200 GPU with a 4 GB KV cache and achieves 24x temporal extrapolation, demonstrating that short-video training is sufficient for long-video synthesis.

PackForcing通过引入三部分KV缓存策略来解决长视频生成中的挑战。它将历史上下文分为sink、mid和recent三种类型，以高效管理内存同时保持质量。PackForcing可以在单个H200 GPU上生成2分钟的视频，帧率为16 FPS，KV缓存仅为4 GB，并实现24倍的时间外推，证明了短视频训练足以用于高质量的长视频合成。

Back to Basics: Revisiting ASR in the Age of Voice Agents

Authors: Geeyang Tay, Wentao Ma, Jaewon Lee, Yuzhi Tang, Daniel Lee, Weisu Yin, Dongming Shen, Silin Meng, Yi Zhu, Mu Li, Alex Smola

First: 2026-03-26T17:59:03+00:00 · Latest: 2026-03-26T17:59:03+00:00

Comments: 10 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Automatic speech recognition (ASR) systems have achieved near-human accuracy on curated benchmarks, yet still fail in real-world voice agents under conditions that current evaluations do not systematically cover. Without diagnostic tools that isolate specific failure factors, practitioners cannot anticipate which conditions, in which languages, will cause what degree of degradation. We introduce WildASR, a multilingual (four-language) diagnostic benchmark sourced entirely from real human speech that factorizes ASR robustness along three axes: environmental degradation, demographic shift, and linguistic diversity. Evaluating seven widely used ASR systems, we find severe and uneven performance degradation, and model robustness does not transfer across languages or conditions. Critically, models often hallucinate plausible but unspoken content under partial or degraded inputs, creating concrete safety risks for downstream agent behavior. Our results demonstrate that targeted, factor-isolated evaluation is essential for understanding and improving ASR reliability in production systems. Besides the benchmark itself, we also present three analytical tools that practitioners can use to guide deployment decisions.

中文标题/摘要

标题：回归基础：在语音代理时代重访ASR

自动语音识别（ASR）系统在精心策划的基准测试中已接近人类的准确性，但在当前评估未系统覆盖的现实世界语音代理条件下仍会失败。没有能够隔离特定失败因素的诊断工具，从业者无法预测哪些条件、哪种语言会导致何种程度的性能下降。我们引入了WildASR，这是一个多语言（四种语言）的诊断基准，完全源自真实的人类语音，将ASR的鲁棒性分解为三个维度：环境退化、人口变化和语言多样性。评估了七种广泛使用的ASR系统，我们发现严重的且不均匀的性能下降，模型的鲁棒性在不同语言或条件下无法转移。关键的是，模型在部分或退化的输入下往往会生成可能但未言说的内容，这为下游代理行为带来了具体的安全风险。我们的结果表明，针对特定因素的分解评估对于理解并改进生产系统中的ASR可靠性至关重要。除了基准本身，我们还介绍了三种分析工具，从业者可以使用这些工具来指导部署决策。

Summary / 总结

The study aims to address the limitations of ASR systems in real-world voice agents by introducing WildASR, a multilingual diagnostic benchmark. Seven ASR systems were evaluated under three factors: environmental degradation, demographic shift, and linguistic diversity. The results showed severe performance degradation across different conditions and languages, with models often generating plausible but incorrect content under partial inputs, posing safety risks. The study emphasizes the need for targeted evaluation to improve ASR reliability in production systems.

研究旨在通过引入WildASR多语言诊断基准来解决ASR系统在真实语音代理中的局限性。对七种ASR系统在环境退化、人口变化和语言多样性三个因素下的表现进行了评估。结果显示，在不同条件和语言下性能严重下降，模型在部分输入下往往会生成合理的但错误的内容，存在安全风险。研究强调了进行针对性评估以提高ASR在生产系统中的可靠性的必要性。

AnyHand: A Large-Scale Synthetic Dataset for RGB(-D) Hand Pose Estimation

Authors: Chen Si, Yulin Liu, Bo Ai, Jianwen Xie, Rolandos Alexandros Potamias, Chuanxia Zheng, Hao Su

First: 2026-03-26T17:58:54+00:00 · Latest: 2026-03-26T17:58:54+00:00

Abs · PDF · Code1 · Code2

Abstract

We present AnyHand, a large-scale synthetic dataset designed to advance the state of the art in 3D hand pose estimation from both RGB-only and RGB-D inputs. While recent works with foundation approaches have shown that an increase in the quantity and diversity of training data can markedly improve performance and robustness in hand pose estimation, existing real-world-collected datasets on this task are limited in coverage, and prior synthetic datasets rarely provide occlusions, arm details, and aligned depth together at scale. To address this bottleneck, our AnyHand contains 2.5M single-hand and 4.1M hand-object interaction RGB-D images, with rich geometric annotations. In the RGB-only setting, we show that extending the original training sets of existing baselines with AnyHand yields significant gains on multiple benchmarks (FreiHAND and HO-3D), even when keeping the architecture and training scheme fixed. More impressively, the model trained with AnyHand shows stronger generalization to the out-of-domain HO-Cap dataset, without any fine-tuning. We also contribute a lightweight depth fusion module that can be easily integrated into existing RGB-based models. Trained with AnyHand, the resulting RGB-D model achieves superior performance on the HO-3D benchmark, showing the benefits of depth integration and the effectiveness of our synthetic data.

中文标题/摘要

标题：AnyHand：用于RGB(-D)手部姿态估计的大规模合成数据集

我们提出了AnyHand，一个大规模合成数据集，旨在推动基于RGB和RGB-D输入的手部姿态估计技术的发展。尽管最近使用基础方法的研究表明，训练数据的数量和多样性增加可以显著提高手部姿态估计的性能和鲁棒性，但现有的真实世界收集的数据集在覆盖范围上有限，而先前的合成数据集很少能同时提供遮挡、手臂细节和对齐的深度信息。为了解决这一瓶颈，我们的AnyHand包含250万张单手和410万张手物交互的RGB-D图像，具有丰富的几何注释。在仅RGB设置中，我们展示了将现有基线的原始训练集扩展到AnyHand，即使保持架构和训练方案不变，也能在多个基准（FreiHAND和HO-3D）上获得显著的性能提升。更令人印象深刻的是，使用AnyHand训练的模型在未进行任何微调的情况下，对HO-Cap数据集的泛化能力更强。我们还贡献了一个轻量级的深度融合模块，可以轻松集成到现有的RGB模型中。使用AnyHand训练的RGB-D模型在HO-3D基准上表现出色，展示了深度集成的好处以及我们合成数据的有效性。

Summary / 总结

AnyHand is a large-scale synthetic dataset for RGB(-D) hand pose estimation, addressing the limitations of existing datasets by providing extensive coverage including occlusions and arm details. The dataset contains 2.5 million single-hand and 4.1 million hand-object interaction RGB-D images with detailed geometric annotations. Training models with AnyHand significantly improves performance on benchmarks like FreiHAND and HO-3D, and the depth fusion module further enhances the RGB-D model's performance on HO-3D without fine-tuning.

AnyHand 是一个大规模合成数据集，用于从 RGB 和 RGB-D 输入中进行 3D 手部姿态估计。它通过提供大量包含遮挡和手臂细节的图像来解决现有数据集的限制。该数据集在 FreiHAND 和 HO-3D 等基准测试中显著提高了性能，并且使用 AnyHand 训练的模型在 HO-Cap 数据集上表现出强大的泛化能力，无需微调。此外，还提出了一种轻量级的深度融合模块，可以轻松集成到现有的 RGB 模型中，从而在 HO-3D 上取得了更好的性能。

Natural-Language Agent Harnesses

Authors: Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, Hai-Tao Zheng

First: 2026-03-26T17:58:15+00:00 · Latest: 2026-03-26T17:58:15+00:00

Comments: under review

Abs · PDF · Code1 · Code2

Abstract

Agent performance increasingly depends on \emph{harness engineering}, yet harness design is usually buried in controller code and runtime-specific conventions, making it hard to transfer, compare, and study as a scientific object. We ask whether the high-level control logic of an agent harness can instead be externalized as a portable executable artifact. We introduce \textbf{Natural-Language Agent Harnesses} (NLAHs), which express harness behavior in editable natural language, and \textbf{Intelligent Harness Runtime} (IHR), a shared runtime that executes these harnesses through explicit contracts, durable artifacts, and lightweight adapters. Across coding and computer-use benchmarks, we conduct controlled evaluations of operational viability, module ablation, and code-to-text harness migration.

中文标题/摘要

标题：自然语言代理 harness

代理性能越来越多地取决于\emph{harness工程设计}，然而harness设计通常埋藏在控制器代码和特定运行时约定中，使其难以转移、比较和作为科学对象进行研究。我们询问是否可以将代理 harness 的高级控制逻辑外部化为可移植的可执行构件。我们引入了\textbf{自然语言代理 harness}（NLAHs），它用可编辑的自然语言表达 harness 行为，并引入了\textbf{智能 harness 运行时}（IHR），这是一种共享运行时，通过明确的合同、持久的构件和轻量级适配器来执行这些 harness。在编程和计算机使用基准测试中，我们对操作可行性、模块消融和代码到文本 harness 迁移进行了受控评估。

Summary / 总结

The research aims to externalize agent harness behavior into portable natural language artifacts to facilitate harness design and study. The study introduces Natural-Language Agent Harnesses (NLAHs) and Intelligent Harness Runtime (IHR), which allow harnesses to be expressed in editable natural language and executed through explicit contracts and lightweight adapters. Experiments across coding and computer-use benchmarks show the operational viability of NLAHs, the importance of modules, and successful migration of code-to-text harnesses.

研究旨在将代理控制逻辑外部化为可编辑的自然语言制品，以促进科学上的研究和比较。研究引入了自然语言代理控制（NLAH）和智能控制运行时（IHR），使控制逻辑能够以可编辑的自然语言形式表达并通过明确的合同、持久的制品和轻量级适配器执行。跨编码和计算机使用基准的评估显示了NLAH的可操作性、模块的重要性以及代码到文本控制逻辑迁移的可行性。

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Authors: Hai X. Pham, David T. Hoffmann, Ricardo Guerrero, Brais Martinez

Venue: CVPR 2026

First: 2026-03-26T17:58:04+00:00 · Latest: 2026-03-26T17:58:04+00:00

Comments: Accepted at CVPR 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/SamsungLabs/concept_centric_clip.

中文标题/摘要

标题：无需硬负样本：概念中心学习促进组合性而不损害对比模型的零样本能力

对比视觉-语言（V&L）模型仍然是各种应用的热门选择。然而，这些模型在学习组合性表示方面的能力有限。先前的方法通常通过生成自定义训练数据来获得硬负样本来解决这一限制。硬负样本已被证明在组合性任务上可以提高性能，但它们往往是针对单一基准的，不具有一般性，并且可能导致基本的V&L能力（如零样本或检索性能）大幅下降，使其变得不切实际。在本工作中，我们采取了不同的方法。我们识别出限制V&L组合性性能的两个根本原因：1）长训练描述词不需要组合性表示；2）文本和图像编码器中的最终全局池化导致完全丢失了学习绑定所需的信息。作为补救措施，我们提出了两个简单的解决方案：1）我们使用标准NLP软件获得短的概念中心描述词部分，并将其与图像对齐；2）我们引入了一个无参数的跨模态注意力池化，从图像编码器中获得概念中心的视觉嵌入。通过这两个改变和简单的辅助对比损失，我们在标准组合性基准上获得了SOTA性能，同时保持或提高了强大的零样本和检索能力。这不会增加推理成本。我们将在https://github.com/SamsungLabs/concept_centric_clip/发布此工作的代码。

Summary / 总结

This work addresses the limitation of contrastive vision-language models in learning compositional representations by proposing a concept-centric learning approach. Instead of relying on hard negative samples, the authors identify two root causes and provide two simple solutions: short concept-centric caption parts and a parameter-free cross-modal attention-pooling. This approach achieves state-of-the-art performance on compositionality benchmarks while maintaining strong zero-shot and retrieval capabilities without increasing inference cost.

该研究通过提出概念中心化的学习方法来解决对比视觉-语言模型在学习组合表示方面的局限性。方法包括使用短的概念中心化标题片段和无参数的跨模态注意力池化来获取概念中心化的视觉嵌入，而不生成硬负样本。该模型在组合性基准测试上取得了最先进的性能，同时保持了强大的零样本和检索能力，且未增加推理成本。

R-C2: Cycle-Consistent Reinforcement Learning Improves Multimodal Reasoning

Authors: Zirui Zhang, Haoyu Dong, Kexin Pei, Chengzhi Mao

First: 2026-03-26T17:58:04+00:00 · Latest: 2026-03-26T17:58:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Robust perception and reasoning require consistency across sensory modalities. Yet current multimodal models often violate this principle, yielding contradictory predictions for visual and textual representations of the same concept. Rather than masking these failures with standard voting mechanisms, which can amplify systematic biases, we show that cross-modal inconsistency provides a rich and natural signal for learning. We introduce RC2, a reinforcement learning framework that resolves internal conflicts by enforcing cross-modal cycle consistency. By requiring a model to perform backward inference, switch modalities, and reliably reconstruct the answer through forward inference, we obtain a dense, label-free reward. This cyclic constraint encourages the model to align its internal representations autonomously. Optimizing for this structure mitigates modality-specific errors and improves reasoning accuracy by up to 7.6 points. Our results suggest that advanced reasoning emerges not only from scaling data, but also from enforcing a structurally consistent understanding of the world.

中文标题/摘要

标题：R-C2：循环一致强化学习提高多模态推理

稳健的感知和推理需要跨感官模态的一致性。然而，当前的多模态模型往往违反这一原则，在同一概念的视觉和文本表示之间产生矛盾的预测。我们不通过标准的投票机制掩盖这些失败，因为这可能会放大系统性偏差，而是展示了跨模态不一致为学习提供了丰富而自然的信号。我们引入了RC2，这是一种通过强制执行跨模态循环一致性来解决内部冲突的强化学习框架。通过要求模型进行反向推理、切换模态，并可靠地通过正向推理重建答案，我们获得了一个密集的、无需标签的奖励。这种循环约束促使模型自主对齐其内部表示。优化这种结构可以减轻模态特定的错误，并将推理准确性提高多达7.6个百分点。我们的结果表明，高级推理不仅来自于数据的扩展，还来自于对世界的一致性结构理解的强制执行。

Summary / 总结

The research aims to improve multimodal reasoning by ensuring consistency across different sensory modalities. The method involves a reinforcement learning framework called R-C2, which enforces cross-modal cycle consistency to resolve internal conflicts. This cyclic constraint encourages models to autonomously align their internal representations, leading to improved reasoning accuracy by up to 7.6 points compared to previous models.

研究旨在通过确保不同感官模态的一致性来提高多模态推理能力。方法是使用一种名为R-C2的强化学习框架，通过强制执行循环一致性来解决内部冲突并对齐内部表示。关键实验发现表明，这种方法可以将推理准确性提高多达7.6个百分点，表明强制执行结构一致性对于高级推理至关重要。

Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

Authors: Abhishek Bhandwaldar, Mihir Choudhury, Ruchir Puri, Akash Srivastava

First: 2026-03-26T17:57:50+00:00 · Latest: 2026-03-26T17:57:50+00:00

Abs · PDF · Code1 · Code2

Abstract

We present an empirical study of how far general-purpose coding agents -- without hardware-specific training -- can optimize hardware designs from high-level algorithmic specifications. We introduce an agent factory, a two-stage pipeline that constructs and coordinates multiple autonomous optimization agents. In Stage~1, the pipeline decomposes a design into sub-kernels, independently optimizes each using pragma and code-level transformations, and formulates an Integer Linear Program (ILP) to assemble globally promising configurations under an area constraint. In Stage~2, it launches $N$ expert agents over the top ILP solutions, each exploring cross-function optimizations such as pragma recombination, loop fusion, and memory restructuring that are not captured by sub-kernel decomposition. We evaluate the approach on 12 kernels from HLS-Eval and Rodinia-HLS using Claude Code (Opus~4.5/4.6) with AMD Vitis HLS. Scaling from 1 to 10 agents yields a mean $8.27\times$ speedup over baseline, with larger gains on harder benchmarks: streamcluster exceeds $20\times$ and kmeans reaches approximately $10\times$. Across benchmarks, agents consistently rediscover known hardware optimization patterns without domain-specific training, and the best designs often do not originate from top-ranked ILP candidates, indicating that global optimization exposes improvements missed by sub-kernel search. These results establish agent scaling as a practical and effective axis for HLS optimization.

中文标题/摘要

标题：用于高级综合的代理工厂：通用编码代理在硬件优化中的潜力有多远？

我们对通用编码代理（无需硬件特定训练）如何从高级算法规范优化硬件设计进行了实证研究。我们引入了一个代理工厂，这是一个两阶段流水线，用于构建和协调多个自主优化代理。在第一阶段，流水线将设计分解为子内核，独立地使用pragma和代码级变换优化每个子内核，并通过面积约束下的整数线性规划（ILP）来组装具有全局潜力的配置。在第二阶段，它在ILP解决方案上启动N个专家代理，每个代理探索子内核分解无法捕捉的跨功能优化，如pragma重组、循环融合和内存重构。我们使用Claude Code（Opus 4.5/4.6）和AMD Vitis HLS在HLS-Eval和Rodinia-HLS的12个内核上评估了该方法。从1个到10个代理的扩展在基线上的平均加速比为8.27倍，对于更难的基准，加速比超过20倍，kmeans达到约10倍。在所有基准中，代理一致地重新发现已知的硬件优化模式，而无需特定领域的训练，最佳设计往往不是来自ILP候选的顶级选项，这表明全局优化揭示了子内核搜索遗漏的改进。这些结果确立了代理扩展作为高级综合优化的实际和有效轴心。

Summary / 总结

This study investigates the capability of general-purpose coding agents to optimize hardware designs from high-level specifications without hardware-specific training. The approach uses an agent factory with two stages: the first decomposes the design into sub-kernels, optimizes them, and formulates an ILP for global optimization, while the second stage launches multiple expert agents to explore cross-function optimizations. Evaluating on 12 kernels, the method yields a mean 8.27x speedup over the baseline, with larger gains on harder benchmarks, and rediscover known hardware optimization patterns without domain-specific training.

研究探讨了通用编码代理在无需特定硬件训练的情况下，如何从高层次规格优化硬件设计的能力。方法使用一个包含两阶段的代理工厂：第一阶段将设计分解为子内核，分别优化并形成整数线性规划（ILP）进行全局优化，第二阶段则启动多个专家代理探索跨功能优化。在12个内核上的评估显示，该方法相对于基线方法平均提高了8.27倍的速度，对于更难的基准，提升更大，并且能够在无需领域特定训练的情况下重新发现已知的硬件优化模式。

The Landscape of AI in Science Education: What is Changing and How to Respond

Authors: Xiaoming Zhai, Kent Crippen

First: 2026-02-08T23:54:23+00:00 · Latest: 2026-03-26T17:55:56+00:00

Abs · PDF · Code1 · Code2

Abstract

This introductory chapter explores the transformative role of artificial intelligence (AI) in reshaping the landscape of science education. Positioned at the intersection of tradition and innovation, AI is altering educational goals, procedures, learning materials, assessment practices, and desired outcomes. We highlight how AI-supported tools, such as intelligent tutoring systems, adaptive learning platforms, automated feedback, and generative content creation--enhance personalization, efficiency, and equity while fostering competencies essential for an AI-driven society, including critical thinking, creativity, and interdisciplinary collaboration. At the same time, this chapter examines the ethical, social, and pedagogical challenges that arise, particularly issues of fairness, transparency, accountability, privacy, and human oversight. To address these tensions, we argue that a Responsible and Ethical Principles (REP) framework is needed to offer guidance for aligning AI integration with values of fairness, scientific integrity, and democratic participation. Through this lens, we synthesize the changes brought to each of the five transformative aspects and the approaches introduced to meet the changes according to the REP framework. We argue that AI should be viewed not as a replacement for human teachers and learners but as a partner that supports inquiry, enriches assessment, and expands access to authentic scientific practices. Aside from what is changing, we conclude by exploring the roles that remain uniquely human, engaging as moral and relational anchors in classrooms, bringing interpretive and ethical judgement, fostering creativity, imagination, and curiosity, and co-constructing meaning through dialogue and community, and assert that these qualities must remain central if AI is to advance equity, integrity, and human flourishing in science education.

中文标题/摘要

标题：人工智能在科学教育中的景观：变化与应对

本章探讨了人工智能（AI）在重塑科学教育景观中的变革性作用。AI 位于传统与创新的交汇处，正在改变教育目标、程序、学习材料、评估实践和期望成果。我们强调了 AI 支持的工具，如智能辅导系统、自适应学习平台、自动化反馈和生成内容的增强作用，这些工具提高了个性化、效率和公平性，同时培养了对于人工智能驱动社会至关重要的批判性思维、创造力和跨学科合作能力。同时，本章还探讨了由此产生的伦理、社会和教学挑战，特别是公平性、透明度、问责制、隐私和人类监督等问题。为应对这些矛盾，我们认为需要一个负责任和伦理原则（REP）框架来指导 AI 集成与公平、科学诚信和民主参与价值观的契合。通过这一视角，我们综合了对每个五个变革方面的变化及其根据 REP 框架提出的方法的分析。我们认为，AI 不应被视为人类教师和学习者的替代品，而应被视为支持探究、丰富评估并扩大对真实科学实践访问的伙伴。除了变化之外，我们还探讨了人类角色的独特性，这些角色作为道德和关系的锚点，在课堂上发挥着作用，提供解释和伦理判断，激发创造力、想象力和好奇心，并通过对话和社区共同构建意义，我们认为这些品质必须保持核心地位，以便 AI 能够促进科学教育中的公平、诚信和人类繁荣。

Summary / 总结

This chapter examines how artificial intelligence (AI) is transforming science education by enhancing personalization, efficiency, and equity through tools like intelligent tutoring systems and adaptive learning platforms. It also addresses ethical challenges such as fairness and transparency, proposing a Responsible and Ethical Principles (REP) framework to guide AI integration. The study argues that AI should complement rather than replace human teachers and learners, supporting inquiry and expanding access to scientific practices while preserving uniquely human roles such as moral and relational anchoring and fostering creativity.

本章探讨了人工智能（AI）如何通过改变教育目标、流程和材料来重塑科学教育，并通过智能辅导系统和自适应学习平台等工具增强个性化和公平性。同时，它还讨论了公平性和透明度等伦理挑战，并提出了一种负责任和伦理原则（REP）框架来指导AI的整合。本章认为，AI应与人类教师互补而非替代，强调创造力和伦理判断等人类特质在促进科学教育中的重要性，以推动教育公平、诚信和人类福祉的发展。

Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Authors: Vishal Narnaware, Animesh Gupta, Kevin Zhai, Zhenyi Wang, Mubarak Shah

First: 2026-03-26T17:53:49+00:00 · Latest: 2026-03-26T17:53:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the intended multimodal task. Consequently, we reinterpret hallucination as a localized optimization error, a phenomenon where the decoder exploits language shortcuts to maximize a proxy score at the expense of visual grounding. To address this objective mismatch, we introduce VISAGE, a training-free decoding framework that calibrates the objective at inference time. VISAGE estimates the proxy discrepancy by quantifying the spatial entropy of cross-attention distributions. By enforcing a localization consensus across attention heads, the method penalizes spatially uniform distributions and re-ranks token commitments to favor visually grounded outcomes. We provide an analytical stability guarantee establishing that VISAGE maintains a bounded objective loss under estimation error. Evaluations across hallucination-sensitive and general-purpose benchmarks demonstrate the robustness of the framework, yielding relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.

中文标题/摘要

标题：视觉关注以确保真实：视觉注意在幻觉鲁棒性MDLLMs中的应用

多模态扩散大型语言模型（MDLLMs）通过并行遮蔽解码实现高并发生成，但其架构仍易受到多模态幻觉的影响。这种结构上的脆弱性源于算法缺陷：解码器根据文本可能性对候选词进行排序，而未验证局部视觉支持。我们证明这种仅语言的排序导致了目标不匹配，其中语言概率质量充当了对多模态任务的不恰当代理。因此，我们将幻觉重新解释为局部优化错误，即解码器利用语言捷径以最大化代理分数，而牺牲了视觉接地。为解决这种目标不匹配，我们引入了VISAGE，这是一种无需训练的解码框架，在推理时校准目标。VISAGE通过量化跨注意力分布的空间熵来估计代理差异。通过在注意力头之间强制执行定位共识，该方法惩罚空间均匀分布并重新排序词元承诺，以有利于视觉接地的结果。我们提供了一个分析性稳定性保证，表明在估计误差下VISAGE保持有界的目标损失。在幻觉敏感和通用基准上的评估表明该框架的鲁棒性，分别在MMMU-val上获得8.59%的相对增益，在HallusionBench上获得7.75%的相对增益。

Summary / 总结

The research addresses the issue of multimodal hallucinations in MDLLMs by introducing VISAGE, a decoding framework that calibrates the objective at inference time to better align with visual grounding. VISAGE estimates the proxy discrepancy using spatial entropy of cross-attention distributions and penalizes spatially uniform distributions to re-rank token commitments. Experiments show that VISAGE improves robustness, achieving relative gains of 8.59% on MMMU-val and 7.75% on HallusionBench.

论文通过引入训练免费的解码框架VISAGE来解决Multimodal Diffusion Large Language Models (MDLLMs)中的多模态幻觉问题。VISAGE通过估计交叉注意力分布的空间熵来校准目标，在推理时倾向于视觉接地的结果。评估结果显示，VISAGE提高了鲁棒性，在MMMU-val上取得了8.59%的相对增益，在HallusionBench上取得了7.75%的相对增益。

TRACE: Object Motion Editing in Videos with First-Frame Trajectory Guidance

Authors: Quynh Phung, Long Mai, Cusuh Ham, Feng Liu, Jia-Bin Huang, Aniruddha Mahapatra

First: 2026-03-26T17:50:42+00:00 · Latest: 2026-03-26T17:50:42+00:00

Comments: webpage: https://trace-motion.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We study object motion path editing in videos, where the goal is to alter a target object's trajectory while preserving the original scene content. Unlike prior video editing methods that primarily manipulate appearance or rely on point-track-based trajectory control, which is often challenging for users to provide during inference, especially in videos with camera motion, we offer a practical, easy-to-use approach to controllable object-centric motion editing. We present Trace, a framework that enables users to design the desired trajectory in a single anchor frame and then synthesizes a temporally consistent edited video. Our approach addresses this task with a two-stage pipeline: a cross-view motion transformation module that maps first-frame path design to frame-aligned box trajectories under camera motion, and a motion-conditioned video re-synthesis module that follows these trajectories to regenerate the object while preserving the remaining content of the input video. Experiments on diverse real-world videos show that our method produces more coherent, realistic, and controllable motion edits than recent image-to-video and video-to-video methods.

中文标题/摘要

标题：TRACE: 视频中物体运动路径编辑的第一帧轨迹引导方法

我们研究视频中物体运动路径编辑的问题，目标是改变目标物体的轨迹同时保留原始场景内容。与之前主要操作外观或依赖基于点轨迹的路径控制的方法不同，这些方法在推断过程中用户往往难以提供，尤其是在有摄像机运动的视频中，我们提供了一种实用且易于使用的可控物体中心运动编辑方法。我们提出了Trace框架，允许用户在单个锚定帧中设计所需的轨迹，然后合成出时间上一致的编辑视频。我们的方法通过两阶段管道来解决此任务：一个跨视图运动变换模块，将第一帧路径设计映射到摄像机运动下的帧对齐的框轨迹，以及一个基于运动条件的视频重新合成模块，遵循这些轨迹再生物体同时保留输入视频的其余内容。在多种真实世界视频上的实验表明，我们的方法比最近的图像到视频和视频到视频方法生成了更连贯、更真实且更可控的运动编辑。

Summary / 总结

The research aims to edit the motion path of objects in videos while maintaining the original scene content. The proposed method, Trace, allows users to design the desired trajectory in a single anchor frame and then synthesize a temporally consistent edited video. The approach uses a two-stage pipeline: a cross-view motion transformation module and a motion-conditioned video re-synthesis module. Experiments demonstrate that Trace produces more coherent, realistic, and controllable motion edits compared to recent methods for image-to-video and video-to-video editing.

研究旨在编辑视频中物体的运动路径，同时保持原始场景内容不变。提出的Trace方法允许用户在单个锚定帧中设计所需的轨迹，然后合成一个时间上一致的编辑视频。该方法使用两阶段管道：一个跨视图运动转换模块，将第一帧路径设计映射到摄像机运动下的帧对齐的盒状轨迹，以及一个根据这些轨迹再生物体的运动条件视频重合成模块，同时保留输入视频的其余内容。实验表明，Trace生成的运动编辑更加连贯、真实且可控，优于最近的图像到视频和视频到视频方法。

Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training

Authors: Jinbo Xing, Zeyinzi Jiang, Yuxiang Tuo, Chaojie Mao, Xiaotang Gai, Xi Chen, Jingfeng Zhang, Yulin Pan, Zhen Han, Jie Xiao, Keyu Yan, Chenwei Xie, Chongyang Zhong, Kai Zhu, Tong Shen, Lianghua Huang, Yu Liu, Yujiu Yang

Venue: CVPR 2026

First: 2026-03-26T17:50:37+00:00 · Latest: 2026-03-26T17:50:37+00:00

Comments: CVPR 2026 Camera-ready, Webpage: https://doubiiu.github.io/projects/WanWeaver

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent unified models have made unprecedented progress in both understanding and generation. However, while most of them accept multi-modal inputs, they typically produce only single-modality outputs. This challenge of producing interleaved content is mainly due to training data scarcity and the difficulty of modeling long-range cross-modal context. To address this issue, we decompose interleaved generation into textual planning and visual consistency modeling, and introduce a framework consisting of a planner and a visualizer. The planner produces dense textual descriptions for visual content, while the visualizer synthesizes images accordingly. Under this guidance, we construct large-scale textual-proxy interleaved data (where visual content is represented in text) to train the planner, and curate reference-guided image data to train the visualizer. These designs give rise to Wan-Weaver, which exhibits emergent interleaved generation ability with long-range textual coherence and visual consistency. Meanwhile, the integration of diverse understanding and generation data into planner training enables Wan-Weaver to achieve robust task reasoning and generation proficiency. To assess the model's capability in interleaved generation, we further construct a benchmark that spans a wide range of use cases across multiple dimensions. Extensive experiments demonstrate that, even without access to any real interleaved data, Wan-Weaver achieves superior performance over existing methods.

中文标题/摘要

标题：Wan-Weaver: 通过解耦训练实现交错多模态生成

近期统一模型在理解和生成方面取得了前所未有的进展。然而，尽管大多数模型接受多模态输入，它们通常只生成单一模态的输出。这种交错内容生成的挑战主要源于训练数据稀缺和长距离跨模态上下文建模的难度。为了解决这一问题，我们将交错生成分解为文本规划和视觉一致性建模，并引入了一个由规划器和视觉器组成的框架。规划器生成视觉内容的密集文本描述，而视觉器据此合成图像。在这一指导下，我们构建了大规模的文本代理交错数据（其中视觉内容以文本形式表示）来训练规划器，并整理了参考指导图像数据来训练视觉器。这些设计催生了Wan-Weaver，它展示了具有长距离文本连贯性和视觉一致性的新兴交错生成能力。同时，将多样化的理解和生成数据整合到规划器训练中，使Wan-Weaver能够实现稳健的任务推理和生成能力。为了评估模型在交错生成方面的能力，我们进一步构建了一个涵盖多个维度广泛应用场景的基准。大量实验表明，即使没有访问任何真实交错数据，Wan-Weaver也优于现有方法。

Summary / 总结

Wan-Weaver addresses the challenge of producing interleaved multi-modal content by decomposing the task into textual planning and visual consistency modeling. It uses large-scale textual-proxy interleaved data to train the planner and reference-guided image data to train the visualizer. The model demonstrates strong interleaved generation ability with long-range textual coherence and visual consistency, and outperforms existing methods even without real interleaved data.

Wan-Weaver通过将任务分解为文本规划和视觉一致性建模来解决生成交错内容的挑战。它使用大规模的文本代理交错数据来训练规划器，并使用参考指导的图像数据来训练视觉器。实验结果表明，Wan-Weaver在长距离文本连贯性和视觉一致性方面优于现有方法，即使没有实际交错数据。

S2D2: Fast Decoding for Diffusion LLMs via Training-Free Self-Speculation

Authors: Ligong Han, Hao Wang, Han Gao, Kai Xu, Akash Srivastava

First: 2026-03-26T17:48:50+00:00 · Latest: 2026-03-26T17:48:50+00:00

Comments: Code is available at https://github.com/phymhan/S2D2

Abs · PDF · Code1 · Code2 · Code3

Abstract

Block-diffusion language models offer a promising path toward faster-than-autoregressive generation by combining block-wise autoregressive decoding with within-block parallel denoising. However, in the few-step regime needed for practical acceleration, standard confidence-thresholded decoding is often brittle: aggressive thresholds hurt quality, while conservative thresholds require unnecessary denoising steps. Existing approaches that address this issue either require additional training or incur extra test-time compute. We present S2D2, a training-free self-speculative decoding framework for block-diffusion language models. Our key observation is that a block-diffusion model becomes autoregressive when the block size is reduced to one, allowing the same pretrained model to act as both drafter and verifier. S2D2 inserts a speculative verification step into standard block-diffusion decoding and uses lightweight routing policies to decide when verification is worth its cost. This yields a hybrid decoding trajectory in which diffusion proposes tokens in parallel, while the autoregressive mode acts as a local sequence-level critic. Across three mainstream block-diffusion families, S2D2 consistently improves the accuracy-speed tradeoff over strong confidence-thresholding baselines. On SDAR, we observe up to $4.7\times$ speedup over autoregressive decoding, and up to $1.57\times$ over a tuned dynamic decoding baseline while improving accuracy by up to $4.5$ points. On LLaDA2.1-Mini, S2D2 remains complementary to built-in self-correction, including a conservative setting where it is $4.4\times$ faster than the static baseline with slightly higher accuracy.

中文标题/摘要

标题：S2D2：通过训练-free 自我推测解码加速扩散大语言模型

块扩散语言模型通过结合块级自回归解码和块内并行去噪，为比自回归生成更快的生成提供了有希望的途径。然而，在需要实际加速的少量步骤范围内，标准置信度阈值解码往往很脆弱：激进的阈值会损害质量，而保守的阈值则需要不必要的去噪步骤。现有解决此问题的方法要么需要额外的训练，要么在测试时增加计算量。我们提出了S2D2，这是一种块扩散语言模型的训练-free 自我推测解码框架。我们的主要观察是，当块大小减少到一个时，块扩散模型会变成自回归的，允许同一个预训练模型同时担任起草者和验证者。S2D2 在标准块扩散解码中插入一个推测性验证步骤，并使用轻量级路由策略来决定何时验证值得其成本。这产生了一种混合解码轨迹，在这种轨迹中，扩散并行提出令牌，而自回归模式则作为局部序列级批评者。在三种主流块扩散家族中，S2D2 在准确性和速度之间的一致性优于强大的置信度阈值基线。在SDAR上，我们观察到与自回归解码相比高达4.7倍的加速，与调优的动态解码基线相比高达1.57倍的加速，同时准确率提高多达4.5个点。在LLaDA2.1-Mini上，S2D2 与内置自我修正保持互补，包括一个保守的设置，在这种设置下，它比静态基线快4.4倍，准确率略高。

Summary / 总结

S2D2 is a training-free self-speculative decoding framework for block-diffusion language models that improves the accuracy-speed tradeoff over confidence-thresholding baselines. By inserting a speculative verification step and using lightweight routing policies, S2D2 achieves up to 4.7 times speedup on SDAR while maintaining or improving accuracy. On LLaDA2.1-Mini, S2D2 is up to 4.4 times faster than the static baseline with slightly higher accuracy.

S2D2 是一种无需训练的自我推测解码框架，用于块扩散语言模型，它在准确性和速度之间提供了优于标准置信阈值方法的权衡。通过插入推测性验证步骤并使用轻量级路由策略，S2D2 在 SDAR 上实现了高达 4.7 倍的速度提升，相比自回归解码，准确率提高了最多 4.5 个点。在 LLaDA2.1-Mini 上，S2D2 比静态基线快 4.4 倍，且准确率略高。

The Kitchen Loop: User-Spec-Driven Development for a Self-Evolving Codebase

Authors: Yannick Roy

First: 2026-03-26T17:45:00+00:00 · Latest: 2026-03-26T17:45:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Code production is now a commodity; the bottleneck is knowing what to build and proving it works. We present the Kitchen Loop, a framework for autonomous, self-evolving software built on a unified trust model: (1) a specification surface enumerating what the product claims to support; (2) 'As a User x 1000', where an LLM agent exercises that surface as a synthetic power user at 1,000x human cadence; (3) Unbeatable Tests, ground-truth verification the code author cannot fake; and (4) Drift Control, continuous quality measurement with automated pause gates. We validate across two production systems over 285+ iterations, producing 1,094+ merged pull requests with zero regressions detected by the regression oracle (methodology in Section 6.1). We observe emergent properties at scale: multi-iteration self-correction chains, autonomous infrastructure healing, and monotonically improving quality gates. The primitives are not new; our contribution is their composition into a production-tested system with the operational discipline that makes long-running autonomous evolution safe.

中文标题/摘要

标题：厨房循环：用户需求驱动的自我演进代码库开发

代码生产现在已成为商品；瓶颈在于知道要构建什么以及如何证明其有效性。我们提出了厨房循环，这是一种基于统一信任模型的自主、自我演进软件框架：(1) 规范表面，列出产品声称支持的功能；(2) “作为用户 x 1000”，其中LLM代理以1000倍的人类速度作为合成超级用户来操作该表面；(3) 无敌测试，代码作者无法伪造的真实验证；(4) 偏离控制，持续的质量测量与自动暂停闸门。我们在两个生产系统上进行了验证，经过285多次迭代，产生了1094多个合并的拉取请求，且未检测到回归（方法论见第6.1节）。我们观察到大规模的新兴特性：多迭代自我纠正链、自主基础设施自我修复以及单调递增的质量门限。这些基本元素并非全新，我们的贡献在于将它们组合成一个经过生产测试的系统，并通过操作纪律使其长期自主演化变得安全。

Summary / 总结

The Kitchen Loop framework aims to address the challenge of knowing what to build and proving it works by using a unified trust model. This model includes a specification surface, synthetic power user exercises, unbeatable tests, and drift control. The framework was validated across two production systems, resulting in 1,094 merged pull requests with no detected regressions. Key findings include multi-iteration self-correction, autonomous infrastructure healing, and monotonically improving quality gates.

Kitchen Loop 是一个自主演化软件的框架，旨在解决知道要构建什么和证明其正确性的问题。它包括规格表、合成超级用户操作、不可伪造的测试和漂移控制。在两个生产系统中，它生成了1,094个合并的拉取请求，并且没有检测到回归，展示了如自我纠正和质量逐步提升等规模化后涌现的特性。

Do Language Models Follow Occam's Razor? An Evaluation of Parsimony in Inductive and Abductive Reasoning

Authors: Yunxin Sun, Abulhair Saparov

First: 2025-09-03T14:22:42+00:00 · Latest: 2026-03-26T17:43:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Non-deductive reasoning, encompassing inductive and abductive reasoning, is essential in addressing complex real-world questions. One key feature of inductive and abductive reasoning is that there are many valid hypotheses; the simplest ones (those that adhere to Occam's Razor) are often most useful. However, this aspect is ignored in recent work that evaluates the non-deductive reasoning capabilities of large language models (LLMs). This work fills this gap, focusing on understanding whether the inductive and abductive reasoning capabilities of LLMs adhere to Occam's Razor, while also examining the correctness of their reasoning. To accomplish this goal, we introduce a framework to synthetically generate reasoning questions that (a) require inductive reasoning and abductive reasoning simultaneously; (b) is readily extended to produce any abductive/inductive reasoning question expressible in first-order logic. The task for the intelligent agent is to produce hypotheses to explain observations under a given world model. We also propose a new automated metric to assess whether hypotheses quantitatively adhere to Occam's Razor; those hypotheses that are correct and simplest are considered high-quality. Our findings on state-of-the-art LLMs suggest that LLMs can perform inductive and abductive reasoning in simple scenarios, but struggle with complex world models and with producing high-quality hypotheses, even with popular reasoning-enhancing techniques such as in-context learning and RLVR.

中文标题/摘要

标题：语言模型遵循奥卡姆剃刀吗？归纳和 abduction 推理简洁性的评估

非演绎推理，包括归纳和 abduction 推理，在解决复杂现实问题时至关重要。归纳和 abduction 推理的一个关键特征是存在许多有效假设；最简单的假设（即遵循奥卡姆剃刀原则的假设）通常最有用。然而，这一方面在最近评估大型语言模型（LLMs）非演绎推理能力的工作中被忽视了。本研究填补了这一空白，专注于理解 LLMs 的归纳和 abduction 推理能力是否遵循奥卡姆剃刀原则，同时检查其推理的正确性。为了实现这一目标，我们引入了一个框架来合成生成需要同时进行归纳推理和 abduction 推理的问题；该框架可以轻松扩展以生成任何可表达在一阶逻辑中的归纳/ abduction 推理问题。智能代理的任务是在给定的世界模型下生成假设以解释观察结果。我们还提出了一种新的自动化评估指标，以评估假设是否定量遵循奥卡姆剃刀原则；那些正确且最简单的假设被认为是高质量的。我们的研究结果表明，LLMs 可以在简单场景中进行归纳和 abduction 推理，但在处理复杂世界模型和生成高质量假设方面存在困难，即使使用了诸如上下文学习和 RLVR 等流行的推理增强技术。

Summary / 总结

This study evaluates whether large language models (LLMs) follow Occam's Razor in inductive and abductive reasoning tasks. It introduces a framework to generate synthetic questions requiring both types of reasoning and proposes a metric to assess hypothesis simplicity. The research finds that LLMs can handle simple scenarios but struggle with complex world models and producing high-quality hypotheses, even with reasoning-enhancing techniques like in-context learning and RLVR.

该研究评估大型语言模型（LLMs）在归纳和演绎推理中是否遵循奥卡姆剃刀原则。它引入了一个生成合成推理问题的框架，这些问题需要同时进行这两种推理，并提出了一种评估假设的简洁性和正确性的指标。研究发现，LLMs 在处理简单场景时可以应对，但在复杂世界模型和生成高质量假设方面存在困难，即使使用了如上下文学习和RLVR等推理增强技术。

Instruction Following by Principled Boosting Attention of Large Language Models

Authors: Vitoria Guardieiro, Avishree Khare, Adam Stein, Eric Wong

First: 2025-06-16T17:42:35+00:00 · Latest: 2026-03-26T17:42:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models' behavior is often shaped by instructions such as system prompts, refusal boundaries, privacy constraints, and tool-use rules that must hold at inference time. Yet in practice these constraints can be violated under long contexts or when user-provided context conflicts with them, creating reliability and safety risks. This motivates inference-time interventions that strengthen instruction influence without retraining. One such intervention is attention steering, which biases attention toward instruction tokens. In this work, we present a unifying theory for attention steering methods by formalizing instruction following as rule-based competition between instruction rules and context-derived rules, with attention mediating which rules dominate. We prove that boosting attention to instruction tokens tilts this competition, making it harder for context to override instruction-following. However, excessive boosting can suppress task-relevant context that should be incorporated alongside the instruction. Guided by this theory, we propose Instruction Attention Boosting (InstABoost), a simple intervention that applies a constant additive bias to instruction-key attention logits across all layers and heads. We evaluate InstABoost against prompting, latent steering, and prior attention steering methods across 15 tasks. InstABoost matches or outperforms all baselines while avoiding the fluency collapse of latent methods and the instruction over-focus of prior attention methods, achieving a stronger steering-quality tradeoff.

中文标题/摘要

标题：基于原则性增强注意力的大语言模型指令遵循

大语言模型的行为通常由系统提示、拒绝边界、隐私约束和工具使用规则等指令塑造，这些约束必须在推理时保持。然而，在实践中，这些约束在长上下文或用户提供的上下文与它们冲突时可能会被违反，从而带来可靠性和安全性风险。这促使了在推理时加强指令影响的干预措施，而无需重新训练。一种这样的干预措施是注意力引导，它会偏向于指令标记的注意力。在本文中，我们通过将指令遵循形式化为基于规则的竞争，即指令规则与上下文衍生规则之间的竞争，并由注意力调解哪种规则占主导地位，来提出一种统一的注意力引导理论。我们证明，增强对指令标记的注意力会倾斜这种竞争，使上下文更难覆盖指令遵循。然而，过度增强可能会抑制应与指令一起纳入的任务相关上下文。根据这一理论，我们提出了指令注意力增强（InstABoost），这是一种简单的干预措施，它在整个层和头中对指令关键注意力对数应用恒定的附加偏置。我们在15个任务上将InstABoost与提示、潜在引导和先前的注意力引导方法进行了评估。InstABoost在所有基线方法中表现相当或更优，同时避免了潜在方法的流畅性崩溃和先前注意力方法的指令过度关注，实现了更强的引导质量权衡。

Summary / 总结

This work addresses the issue of large language models potentially violating instructions under certain contexts by proposing a method called Instruction Attention Boosting (InstABoost). The motivation is to strengthen instruction influence without retraining the models. InstABoost applies a constant additive bias to instruction-key attention logits across all layers and heads. Experiments across 15 tasks show that InstABoost matches or outperforms baseline methods while avoiding the fluency collapse of latent methods and the instruction over-focus of prior attention methods, achieving a better tradeoff between steering quality and fluency.

该研究针对大型语言模型因长上下文或用户提供的信息冲突而可能违反指令的问题，提出了增强模型指令遵循性的方法——指令注意力增强（InstABoost）。通过提升对指令标记的关注度，InstABoost 确保指令被更可靠地遵循，同时避免抑制可能有益的任务相关信息。在15个任务上的评估表明，InstABoost 在保持良好的引导质量与流畅性平衡方面优于或匹配其他方法。

CodeRefine: A Pipeline for Enhancing LLM-Generated Code Implementations of Research Papers

Authors: Ekaterina Trofimova, Emil Sataev, Abhijit Singh Jowhari

First: 2024-08-23T20:51:04+00:00 · Latest: 2026-03-26T17:41:14+00:00

Comments: The results mentioned in the paper are non-reproducible. We have rechecked the metrics, and they do not match with the ones that have been provided in the paper. Therefore, we accept that this article is neither suitable nor up to the mark for the scientific community and must be with-drawn. We fully understand the consequences, and would like to wishfully retract this article

Abs · PDF · Code1 · Code2

Abstract

This paper presents CodeRefine, a novel framework for automatically transforming research paper methodologies into functional code using Large Language Models (LLMs). Our multi-step approach first extracts and summarizes key text chunks from papers, analyzes their code relevance, and creates a knowledge graph using a predefined ontology. Code is then generated from this structured representation and enhanced through a proposed retrospective retrieval-augmented generation approach. CodeRefine addresses the challenge of bridging theoretical research and practical implementation, offering a more accurate alternative to LLM zero-shot prompting. Evaluations on diverse scientific papers demonstrate CodeRefine's ability to improve code implementation from the paper, potentially accelerating the adoption of cutting-edge algorithms in real-world applications.

中文标题/摘要

标题：CodeRefine：一种增强LLM生成的研究论文代码实现的流水线

本文介绍了CodeRefine，一种使用大型语言模型（LLMs）自动将研究论文方法论转换为功能性代码的新型框架。我们的多步方法首先从论文中提取和总结关键文本片段，分析其代码相关性，并使用预定义的本体构建知识图谱。然后从这种结构化表示生成代码，并通过提出的回顾性检索增强生成方法进行增强。CodeRefine 解决了理论研究与实际实现之间的鸿沟，提供了比LLM零样本提示更准确的替代方案。对多种科学论文的评估表明，CodeRefine 能够改进从论文中生成的代码实现，有可能加速前沿算法在实际应用中的采用。

The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition

Authors: Yuwen Tan, Yuan Qing, Boqing Gong

Venue: CVPR 2026

First: 2025-05-30T17:40:46+00:00 · Latest: 2026-03-26T17:38:41+00:00

Comments: Accepted to CVPR 2026. Project page and code: https://yuanqing-ai.github.io/llm-hierarchy/

Abs · PDF · Code1 · Code2 · Project1

Abstract

This paper reveals that many open-source large language models (LLMs) lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs' hierarchical visual recognition (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs' bottleneck effect because the VQA tasks improve the LLMs' hierarchical consistency more than the vision LLMs'. We conjecture that one cannot make open-source vision LLMs understand visual concepts hierarchically until LLMs possess corresponding taxonomy knowledge.

中文标题/摘要

标题：LLM瓶颈：开源视觉LLM为何在层次视觉识别中挣扎

本文揭示了许多开源大型语言模型（LLM）缺乏对视觉世界的层次知识，甚至对已建立的生物学分类学也不了解。这一缺陷使得LLM成为视觉LLM层次视觉识别的瓶颈（例如，能识别海葵鱼但不能识别脊椎动物）。我们通过使用来自六个分类学和四个图像数据集的一百多万个四选一的视觉问答（VQA）任务得出这些结论。有趣的是，使用我们的VQA任务微调视觉LLM再次证实了LLM的瓶颈效应，因为VQA任务比视觉LLM更有效地提高了LLM的层次一致性。我们推测，直到LLM具备相应的分类学知识之前，无法使开源视觉LLM理解视觉概念的层次性。

Summary / 总结

This paper investigates why open-source large language models (LLMs) struggle with hierarchical visual recognition, finding that they lack knowledge of well-established biology taxonomies. The study uses one million four-choice visual question answering tasks from six taxonomies and four image datasets to demonstrate that LLMs are unable to recognize hierarchical visual concepts such as Anemone Fish and Vertebrate. Finetuning vision LLMs with these tasks does not significantly improve their performance, suggesting that LLMs need to acquire taxonomy knowledge to understand visual concepts hierarchically.

该论文探讨了为什么许多开源大型语言模型（LLMs）在层次视觉识别方面存在困难，缺乏关于视觉世界层次结构的知识，如生物学分类。作者使用了来自六个分类和四个图像数据集的一百万个四选一的视觉问答（VQA）任务来揭示这一问题。使用这些任务对视觉LLMs进行微调并未显著提高其性能，表明LLMs需要具备相关分类知识才能理解视觉概念的层次结构。

Enhancing Automatic Chord Recognition via Pseudo-Labeling and Knowledge Distillation

Authors: Nghia Phan, Rong Jin, Gang Liu, Xiao Dong

First: 2026-02-23T12:32:53+00:00 · Latest: 2026-03-26T17:38:09+00:00

Comments: 9 pages, 6 figures, 3 tables

Abs · PDF · Code1 · Code2

Abstract

Automatic Chord Recognition (ACR) is constrained by the scarcity of aligned chord labels, as well-aligned annotations are costly to acquire. At the same time, open-weight pre-trained models are currently more accessible than their proprietary training data. In this work, we present a two-stage training pipeline that leverages pre-trained models together with unlabeled audio. The proposed method decouples training into two stages. In the first stage, we use a pre-trained BTC model as a teacher to generate pseudo-labels for over 1,000 hours of diverse unlabeled audio and train a student model solely on these pseudo-labels. In the second stage, the student is continually trained on ground-truth labels as they become available. To prevent catastrophic forgetting of the representations learned in the first stage, we apply selective knowledge distillation (KD) from the teacher as a regularizer. In our experiments, two models (BTC, 2E1D) were used as students. In stage 1, using only pseudo-labels, the BTC student achieves over 98% of the teacher's performance, while the 2E1D model achieves about 96% across seven standard mir_eval metrics. After a single training run for both students in stage 2, the resulting BTC student model surpasses the traditional supervised learning baseline by 2.5% and the original pre-trained teacher model by 1.55% on average across all metrics. The resulting 2E1D student model improves over the traditional supervised learning baseline by 2.67% on average and achieves almost the same performance as the teacher. Both cases show large gains on rare chord qualities.

中文标题/摘要

标题：通过伪标签和知识蒸馏增强自动和弦识别

自动和弦识别（ACR）受限于对齐和弦标签的稀缺性，因为准确的对齐注释成本高昂。同时，开源预训练模型目前比其专有训练数据更容易获得。在本文中，我们提出了一种两阶段训练管道，利用预训练模型和未标记的音频。所提出的方法将训练分为两个阶段。在第一阶段，我们使用预训练的BTC模型作为教师，为超过1000小时的多样化未标记音频生成伪标签，并仅使用这些伪标签训练学生模型。在第二阶段，随着真实标签的可用性，学生模型持续训练。为了防止第一阶段学习的表示灾难性遗忘，我们应用选择性知识蒸馏（KD）作为正则化器。在我们的实验中，使用了两种模型（BTC，2E1D）作为学生。在第一阶段，仅使用伪标签，BTC学生模型达到了教师性能的98%以上，而2E1D模型在七个标准mir_eval指标上达到了约96%。在第二阶段，对学生模型进行一次训练后，BTC学生模型在所有指标上平均超越传统监督学习基线2.5%，并超越原始预训练教师模型1.55%。2E1D学生模型在平均上超越传统监督学习基线2.67%，并几乎达到教师的性能。两种情况都显示了在稀有和弦品质上的巨大收益。

Summary / 总结

This study addresses the challenge of automatic chord recognition (ACR) due to the scarcity of aligned chord labels by proposing a two-stage training pipeline. In the first stage, a pre-trained model (teacher) generates pseudo-labels for unlabeled audio, which are then used to train a student model. In the second stage, the student is fine-tuned with ground-truth labels. The BTC and 2E1D models were used as students, achieving 98% and 96% of the teacher's performance, respectively, in the first stage. After fine-tuning, the BTC model outperformed the supervised learning baseline by 2.5% and the teacher by 1.55%, while the 2E1D model improved the baseline by 2.67%. Both models showed significant gains on rare chord qualities.

该研究针对自动和弦识别（ACR）中由于缺乏对齐的和弦标签而面临的挑战，提出了一种两阶段训练管道。方法使用预训练模型作为教师生成未标记音频的伪标签，然后用这些伪标签训练学生模型。在第二阶段，学生模型进一步使用真实标签进行训练。BTC和2E1D模型被用作学生，在第一阶段表现出高性能，并在第二阶段超越了传统的监督学习基线，特别是在稀有和弦品质方面。

Just Zoom In: Cross-View Geo-Localization via Autoregressive Zooming

Authors: Yunus Talha Erzurumlu, Jiyong Kwag, Alper Yilmaz

First: 2026-03-26T17:36:33+00:00 · Latest: 2026-03-26T17:36:33+00:00

Comments: 18 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Cross-view geo-localization (CVGL) estimates a camera's location by matching a street-view image to geo-referenced overhead imagery, enabling GPS-denied localization and navigation. Existing methods almost universally formulate CVGL as an image-retrieval problem in a contrastively trained embedding space. This ties performance to large batches and hard negative mining, and it ignores both the geometric structure of maps and the coverage mismatch between street-view and overhead imagery. In particular, salient landmarks visible from the street view can fall outside a fixed satellite crop, making retrieval targets ambiguous and limiting explicit spatial inference over the map. We propose Just Zoom In, an alternative formulation that performs CVGL via autoregressive zooming over a city-scale overhead map. Starting from a coarse satellite view, the model takes a short sequence of zoom-in decisions to select a terminal satellite cell at a target resolution, without contrastive losses or hard negative mining. We further introduce a realistic benchmark with crowd-sourced street views and high-resolution satellite imagery that reflects real capture conditions. On this benchmark, Just Zoom In achieves state-of-the-art performance, improving Recall@1 within 50 m by 5.5% and Recall@1 within 100 m by 9.6% over the strongest contrastive-retrieval baseline. These results demonstrate the effectiveness of sequential coarse-to-fine spatial reasoning for cross-view geo-localization.

中文标题/摘要

标题：只需放大：通过自回归放大进行跨视图地理定位

跨视图地理定位（CVGL）通过将街道视图图像与地理参考的航拍图像匹配来估计相机的位置，从而实现GPS受限的定位和导航。现有方法几乎都将CVGL普遍形式化为对比训练嵌入空间中的图像检索问题。这将性能与大批次和硬负样本挖掘联系起来，并忽略了地图的几何结构以及街道视图和航拍图像之间的覆盖不匹配。特别是，从街道视图可见的显著地标可能位于固定卫星裁剪之外，使检索目标变得模糊，并限制了对地图的显式空间推理。我们提出了一种替代方案Just Zoom In，通过在城市规模的航拍地图上进行自回归放大来进行CVGL。从粗略的卫星视图开始，模型通过一系列放大决策选择目标分辨率的终端卫星单元，而无需对比损失或硬负样本挖掘。我们还引入了一个基于众包街道视图和高分辨率卫星图像的现实基准，反映了实际拍摄条件。在该基准上，Just Zoom In 达到了最先进的性能，Recall@1 在50米内的提升为5.5%，在100米内的提升为9.6%，超过最强的对比检索基线。这些结果表明，顺序的粗细空间推理对于跨视图地理定位的有效性。

Summary / 总结

The paper addresses the challenge of cross-view geo-localization (CVGL) by proposing Just Zoom In, which formulates the task as autoregressive zooming over a city-scale overhead map. Unlike existing methods that rely on contrastive learning and hard negative mining, Just Zoom In starts from a coarse satellite view and makes a series of zoom-in decisions to select a terminal satellite cell at the target resolution. The authors introduce a realistic benchmark and show that Just Zoom In outperforms the strongest contrastive-retrieval baseline, improving Recall@1 within 50 meters by 5.5% and within 100 meters by 9.6%. This demonstrates the effectiveness of sequential coarse-to-fine spatial reasoning in CVGL.

论文提出Just Zoom In方法，通过在城市规模的航拍地图上进行自回归缩放来估计相机的位置。该方法避免使用对比损失和硬负样本挖掘，专注于逐步从粗到细的空间推理。在新的基准测试中，Just Zoom In优于现有方法，分别在50米和100米内的Recall@1上提高了5.5%和9.6%。

Analysing Environmental Efficiency in AI for X-Ray Diagnosis

Authors: Liam Kearns

Venue: Journal of AI 10 (2026) 37-55

First: 2025-10-31T14:19:57+00:00 · Latest: 2026-03-26T17:32:49+00:00

Comments: Accepted for publication in Journal of AI. The final published version is available at https://doi.org/10.61969/jai.1838517

Abs · PDF · Code1 · Code2

Abstract

The integration of AI tools into medical applications has aimed to improve the efficiency of diagnosis. The emergence of large language models (LLMs), such as ChatGPT and Claude, has expanded this integration even further despite a concern for their environmental impact. Because of LLM versatility and ease of use through APIs, these larger models are often utilised even though smaller, custom models can be used instead. In this paper, LLMs and small discriminative models are integrated into a Mendix application to detect Covid-19 in chest X-rays. These discriminative models are also used to provide knowledge bases for LLMs to improve accuracy. This provides a benchmark study of 14 different model configurations for comparison of diagnostic accuracy and environmental impact. The findings indicated that while smaller models reduced the carbon footprint of the application, the output was biased towards a positive diagnosis and the output probabilities were lacking confidence. Meanwhile, restricting LLMs to only give probabilistic output caused poor performance in both accuracy and carbon footprint, demonstrating the risk of using LLMs as a universal AI solution. While using the smaller LLM GPT-4.1-Nano reduced the carbon footprint by 94.2% compared to the larger models, this was still disproportionate to the discriminative models; the most efficient solution was the Covid-Net model. Although it had a larger carbon footprint than other small models, its carbon footprint was 99.9% less than when using GPT-4.5-Preview, whilst achieving an accuracy of 95.5%, the highest of all models examined. This paper contributes to knowledge by comparing generative and discriminative models in Covid-19 detection as well as highlighting the environmental risk of using generative tools for classification tasks.

中文标题/摘要

标题：分析AI在X射线诊断中的环境效率

将AI工具集成到医疗应用中旨在提高诊断效率。尽管大型语言模型（LLMs）如ChatGPT和Claude的出现进一步扩展了这种集成，但对其环境影响的担忧仍然存在。由于LLMs的多功能性和通过API使用的便捷性，这些较大的模型经常被使用，尽管较小的定制模型可以替代。在本文中，LLMs和小型判别模型被集成到一个Mendix应用程序中，用于检测胸部X射线中的新冠肺炎。这些判别模型还用于为LLMs提供知识库，以提高准确性。这为14种不同模型配置的诊断准确性和环境影响提供了基准研究。研究结果表明，虽然较小的模型减少了应用程序的碳足迹，但输出偏向于阳性诊断，输出概率缺乏信心。同时，限制LLMs仅提供概率输出导致在准确性和碳足迹方面表现不佳，表明使用LLMs作为通用AI解决方案的风险。使用较小的LLM GPT-4.1-Nano相比较大模型减少了94.2%的碳足迹，但与判别模型相比仍不成比例；最高效的解决方案是Covid-Net模型。尽管其碳足迹比其他小型模型大，但其碳足迹比使用GPT-4.5-Preview减少了99.9%，同时准确率达到95.5%，这是所有研究模型中最高的。本文通过比较生成性和判别性模型在新冠肺炎检测中的应用，以及强调使用生成工具进行分类任务的环境风险，为知识做出了贡献。

Summary / 总结

This paper investigates the environmental efficiency of AI tools in X-ray diagnosis, particularly focusing on the use of large language models (LLMs) and smaller discriminative models. The study compares 14 different model configurations to assess diagnostic accuracy and environmental impact. Key findings include that smaller models significantly reduce the carbon footprint but can be biased towards positive diagnoses. Restricting LLMs to probabilistic outputs worsens both accuracy and environmental impact. The most efficient solution was the Covid-Net model, which had a 99.9% lower carbon footprint than GPT-4.5-Preview while maintaining high accuracy at 95.5%. This work highlights the environmental risks of using LLMs for classification tasks and emphasizes the importance of using smaller discriminative models for better efficiency and accuracy.

本文研究了将AI工具集成到医疗应用中进行X光诊断的环境效率，特别是大型语言模型（LLMs）和较小的判别模型的应用。研究比较了14种不同的模型配置，以评估诊断准确性和环境影响。主要发现包括，较小的模型可以显著减少碳足迹，但可能会出现偏差且缺乏信心。限制LLMs仅给出概率输出也会对准确性和碳足迹产生负面影响。最高效的解决方案是Covid-Net模型，其碳足迹比GPT-4.5-Preview低99.9%，同时准确率达到95.5%，这是所有模型中最高的。

Self-Improvement of Large Language Models: A Technical Overview and Future Outlook

Authors: Haoyan Yang, Mario Xerri, Solha Park, Huajian Zhang, Yiyang Feng, Sai Akhil Kogilathota, Jiawei Zhou

First: 2026-03-26T17:32:37+00:00 · Latest: 2026-03-26T17:32:37+00:00

Abs · PDF · Code1 · Code2

Abstract

As large language models (LLMs) continue to advance, improving them solely through human supervision is becoming increasingly costly and limited in scalability. As models approach human-level capabilities in certain domains, human feedback may no longer provide sufficiently informative signals for further improvement. At the same time, the growing ability of models to make autonomous decisions and execute complex actions naturally enables abstractions in which components of the model development process can be progressively automated. Together, these challenges and opportunities have driven increasing interest in self-improvement, where models autonomously generate data, evaluate outputs, and iteratively refine their own capabilities. In this paper, we present a system-level perspective on self-improving language models and introduce a unified framework that organizes existing techniques. We conceptualize the self-improvement system as a closed-loop lifecycle, consisting of four tightly coupled processes: data acquisition, data selection, model optimization, and inference refinement, along with an autonomous evaluation layer. Within this framework, the model itself plays a central role in driving each stage: collecting or generating data, selecting informative signals, updating its parameters, and refining outputs, while the autonomous evaluation layer continuously monitors progress and guides the improvement cycle across stages. Following this lifecycle perspective, we systematically review and analyze representative methods for each component from a technical standpoint. We further discuss current limitations and outline our vision for future research toward fully self-improving LLMs.

中文标题/摘要

标题：大型语言模型的自我提升：技术综述与未来展望

随着大型语言模型（LLMs）的不断进步，仅通过人类监督来改进它们的成本越来越高，且在可扩展性方面也受到限制。当模型在某些领域接近人类水平的能力时，人类反馈可能不再能提供足够的信息来进一步改进。同时，模型越来越能够自主做出决策并执行复杂操作，自然地使模型开发过程中的组件能够逐步自动化。这些挑战和机遇共同推动了对自我提升的兴趣增加，在自我提升中，模型自主生成数据、评估输出并迭代地完善自身能力。在本文中，我们从系统层面介绍了自我提升的语言模型，并引入了一个统一框架来组织现有的技术。我们将自我提升系统视为一个闭环生命周期，由四个紧密耦合的过程组成：数据获取、数据选择、模型优化和推理精炼，以及一个自主评估层。在这个框架中，模型本身在每个阶段都发挥着核心作用：收集或生成数据、选择信息信号、更新其参数并精炼输出，而自主评估层则持续监控进展并引导跨阶段的改进循环。遵循这一生命周期视角，我们从技术角度系统地回顾和分析了每个组件的代表性方法。我们进一步讨论了当前的局限性，并概述了未来研究以实现完全自我提升的LLMs的愿景。

Summary / 总结

This paper addresses the challenge of improving large language models (LLMs) beyond human supervision by introducing a self-improvement framework. The method involves a closed-loop lifecycle with four processes: data acquisition, data selection, model optimization, and inference refinement, all driven by the model itself. Key findings include the model's ability to autonomously generate and evaluate data, leading to iterative capability refinement, and the need for an autonomous evaluation layer to monitor progress. The framework aims to automate the model development process, making it more scalable and efficient as LLMs approach human-level capabilities in certain domains.

本文探讨了在超越人类监督的情况下改进大型语言模型（LLMs）的挑战，提出了一种自我改进框架。该方法涉及一个闭环生命周期，包括数据获取、数据选择、模型优化和推理精炼四个过程，均由模型自身驱动。关键发现包括模型在生成和评估数据中的核心作用，以及自主评估层持续监控进展并引导改进循环。研究还指出了当前的局限性，并概述了未来研究的方向，以实现完全自我改进的LLMs。

History

20260327_0403 20260326_0352 20260325_0402 20260323_0328 20260322_0328 20260321_0342 20260320_0351 20260319_0353 20260318_0401 20260317_0403 20260316_0333 20260315_0330 20260314_0336 20260313_0346 20260312_0346 20260311_0342 20260310_0345 20260309_0327 20260308_0327 20260307_0339 20260306_0356 20260305_0342 20260303_0342 20260301_0326 20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553