arXiv 论文速递

Snapshot: 20260320_0351

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Authors: Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna, Christopher Clark, Yong Jae Lee, Sangho Lee

First: 2026-03-18T17:59:56+00:00 · Latest: 2026-03-18T17:59:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perception tasks such as action recognition and object segmentation, without adapting to downstream vision-language tasks; or (2) only within the LLM while leaving the ViT output intact, often requiring complex text-conditioned token selection mechanisms. In this paper, we introduce Spatio-Temporal Token Scoring (STTS), a simple and lightweight module that prunes vision tokens across both the ViT and the LLM without text conditioning or token merging, and is fully compatible with end-to-end training. By learning how to score temporally via an auxiliary loss and spatially via LLM downstream gradients, aided by our efficient packing algorithm, STTS prunes 50% of vision tokens throughout the entire architecture, resulting in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in average performance across 13 short and long video QA tasks. Efficiency gains increase with more sampled frames per video. Applying test-time scaling for long-video QA further yields performance gains of 0.5-1% compared to the baseline. Overall, STTS represents a novel, simple yet effective technique for unified, architecture-wide vision token pruning.

中文标题/摘要

标题：统一时空令牌评分以提高视频VLMs的效率

令牌剪枝对于提高视觉语言模型（VLMs）的计算效率至关重要，特别是在视频任务中，其中时间冗余普遍存在。先前的方法通常仅在视觉变换器（ViT）内剪枝令牌，适用于单模态感知任务，如动作识别和对象分割，而不适应下游视觉语言任务；或者仅在LLM内剪枝令牌，而保留ViT输出不变，通常需要复杂的文本条件令牌选择机制。在本文中，我们引入了时空令牌评分（STTS），这是一种简单且轻量级的模块，可以在ViT和LLM之间剪枝视觉令牌，无需文本条件或令牌合并，并且完全兼容端到端训练。通过学习如何通过辅助损失学习时间评分以及通过LLM下游梯度学习空间评分，借助我们高效的打包算法，STTS在整个架构中剪枝了50%的视觉令牌，从而在训练和推理过程中效率提高了62%，并且平均性能下降了0.7%。随着每段视频采样帧数的增加，效率提升更加明显。在长视频问答测试时应用缩放进一步提高了0.5-1%的性能，与基线相比。总体而言，STTS代表了一种新颖、简单而有效的统一架构视觉令牌剪枝技术。

Summary / 总结

This paper introduces Spatio-Temporal Token Scoring (STTS), a method for enhancing the efficiency of vision-language models (VLMs) by pruning 50% of vision tokens across both the vision transformer and the language model. STTS learns to score tokens spatially and temporally without text conditioning or token merging, and is compatible with end-to-end training. The approach results in a 62% improvement in efficiency during both training and inference with only a 0.7% drop in performance across 13 video QA tasks, with efficiency gains increasing with more sampled frames per video.

本文通过引入时空令牌评分（STTS）方法解决了视频任务中视觉语言模型（VLMs）的计算效率问题，该方法在视觉变压器和语言模型之间修剪视觉令牌，无需文本条件或令牌合并。STTS 使用辅助损失来按时间评分令牌，并使用语言模型下游梯度来按空间评分，实现了 62% 的效率提升，同时在 13 个视频 QA 任务上的性能下降仅为 0.7%。随着每视频采样帧数的增加，效率提升更加明显，测试时的缩放进一步提高了 0.5-1% 的性能。

Universal Skeleton Understanding via Differentiable Rendering and MLLMs

Authors: Ziyi Wang, Peiming Li, Xinshun Wang, Yang Tang, Kai-Kuang Ma, Mengyuan Liu

First: 2026-03-18T17:59:12+00:00 · Latest: 2026-03-18T17:59:12+00:00

Comments: 32 pages, 15 figures

Abs · PDF · Code1 · Code2

Abstract

Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet remain confined to their native modalities and cannot directly process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors for text alignment, or quantize motion into discrete tokens that generalize poorly across heterogeneous skeleton formats. We present SkeletonLLM, which achieves universal skeleton understanding by translating arbitrary skeleton sequences into the MLLM's native visual modality. At its core is DrAction, a differentiable, format-agnostic renderer that converts skeletal kinematics into compact image sequences. Because the pipeline is end-to-end differentiable, MLLM gradients can directly guide the rendering to produce task-informative visual tokens. To further enhance reasoning capabilities, we introduce a cooperative training strategy: Causal Reasoning Distillation transfers structured, step-by-step reasoning from a teacher model, while Discriminative Finetuning sharpens decision boundaries between confusable actions. SkeletonLLM demonstrates strong generalization on diverse tasks including recognition, captioning, reasoning, and cross-format transfer -- suggesting a viable path for applying MLLMs to non-native modalities. Code will be released upon acceptance.

中文标题/摘要

标题：通过可微渲染和MLLM实现通用骨架理解

多模态大型语言模型（MLLMs）在视觉-语言推理方面表现出色，但仍然局限于其原生模态，无法直接处理如人类骨架等结构化、非视觉数据。现有方法要么将骨架动态压缩为有损特征向量以进行文本对齐，要么将运动量化为难以在不同骨架格式之间泛化的离散标记。我们提出了SkeletonLLM，通过将任意骨架序列转换为MLLM的原生视觉模态来实现通用骨架理解。其核心是DrAction，一种格式无关的可微渲染器，将骨骼运动学转换为紧凑的图像序列。由于整个管道是端到端可微的，MLLM的梯度可以直接指导渲染以生成任务相关信息的视觉标记。为了进一步增强推理能力，我们引入了一种协作训练策略：因果推理蒸馏从教师模型中转移结构化的逐步推理，而判别性微调则细化混淆动作之间的决策边界。SkeletonLLM在包括识别、描述、推理和跨格式转移等多种任务上表现出强大的泛化能力——这表明MLLM可以应用于非原生模态的一种可行路径。代码将在接受后发布。

Summary / 总结

The research aims to enable multimodal large language models (MLLMs) to understand human skeletons by translating skeleton sequences into visual modality through a differentiable renderer called DrAction. The method involves cooperative training strategies, including causal reasoning distillation and discriminative finetuning, to enhance reasoning capabilities. Key experimental findings show strong generalization of SkeletonLLM on various tasks such as recognition, captioning, reasoning, and cross-format transfer, suggesting its potential for applying MLLMs to non-native modalities.

研究旨在通过将人体骨架转换为视觉数据，使多模态大型语言模型（MLLMs）能够理解骨架。方法包括使用DrAction，一种可微分渲染器，将骨架运动转换为图像序列。SkeletonLLM结合因果推理蒸馏和辨别性微调，在识别、描述和推理等多种任务上表现出强大的泛化能力，表明MLLMs可能处理非原生模态如结构化数据的一种途径。

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

Authors: Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang, Marc Pollefeys

First: 2026-03-18T17:59:10+00:00 · Latest: 2026-03-18T17:59:10+00:00

Comments: Project Page: https://kevinqu7.github.io/loc3r-vlm

Abs · PDF · Code1 · Code2 · Project1

Abstract

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm

中文标题/摘要

标题：Loc3R-VLM：基于语言的空间定位与三维推理

多模态大型语言模型（MLLMs）在连接视觉和语言方面取得了显著进展，但仍然难以理解空间关系和视角相关的推理。最近的努力旨在通过几何提示增强输入表示，而不是明确地教会模型在三维空间中进行推理。我们提出了Loc3R-VLM框架，该框架使二维视觉语言模型具备从单目视频输入中获得的高级三维理解能力。受人类空间认知的启发，Loc3R-VLM依赖于两个联合目标：全局布局重建以构建场景结构的整体表示，以及明确的情景建模以锚定主观视角。这些目标提供了直接的空间监督，使感知和语言在三维上下文中得到约束。为了确保几何一致性并实现度量级对齐，我们利用从预训练的三维基础模型中提取的轻量级相机姿态先验。Loc3R-VLM在基于语言的空间定位方面达到了最先进的性能，并在基于文本和视频的三维问答基准测试中优于现有方法，证明了我们的空间监督框架能够实现强大的三维理解。项目页面：https://kevinqu7.github.io/loc3r-vlm

Summary / 总结

Loc3R-VLM is a framework that enhances 2D Vision-Language Models with 3D understanding capabilities using monocular video input. It focuses on global layout reconstruction and explicit situation modeling to provide spatial supervision. This approach leads to state-of-the-art performance in language-based localization and outperforms existing methods on 3D question-answering benchmarks.

Loc3R-VLM 是一个框架，通过单目视频输入增强 2D 视觉-语言模型的 3D 理解能力，重点在于全局布局重建和显式情境建模以提升空间认知。它利用轻量级的相机姿态先验以确保几何一致性，并在语言基于的定位和 3D 问答基准测试中达到最先进的性能，超越现有方法。

AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

Authors: Zhang Zhang, Shuqi Lu, Hongjin Qian, Di He, Zheng Liu

First: 2026-03-18T17:58:25+00:00 · Latest: 2026-03-18T17:58:25+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Building LLM-based agents has become increasingly important. Recent works on LLM-based agent self-evolution primarily record successful experiences as textual prompts or reflections, which cannot reliably guarantee efficient task re-execution in complex scenarios. We propose AgentFactory, a new self-evolution paradigm that preserves successful task solutions as executable subagent code rather than textual experience. Crucially, these subagents are continuously refined based on execution feedback, becoming increasingly robust and efficient as more tasks are encountered. Saved subagents are pure Python code with standardized documentation, enabling portability across any Python-capable system. We demonstrate that AgentFactory enables continuous capability accumulation: its library of executable subagents grows and improves over time, progressively reducing the effort required for similar tasks without manual intervention. Our implementation is open-sourced at https://github.com/zzatpku/AgentFactory, and our demonstration video is available at https://youtu.be/iKSsuAXJHW0.

中文标题/摘要

标题：AgentFactory：通过可执行子代理积累与重用实现自我演化的框架

基于LLM的代理构建变得越来越重要。最近关于基于LLM的代理自我演化的研究主要记录成功经验为文本提示或反思，这不能可靠地保证在复杂场景中高效地重新执行任务。我们提出了AgentFactory，这是一种新的自我演进范式，将成功任务解决方案保存为可执行的子代理代码，而不是文本经验。关键的是，这些子代理会根据执行反馈不断优化，遇到更多任务时变得越来越稳健和高效。保存的子代理是纯Python代码，具有标准化文档，可以在任何支持Python的系统中实现移植。我们证明了AgentFactory能够实现持续的能力积累：其可执行子代理库随着时间的推移不断增长和改进，逐步减少完成类似任务所需的努力，而无需手动干预。我们的实现已开源在https://github.com/zzatpku/AgentFactory，演示视频可在https://youtu.be/iKSsuAXJHW0找到。

Summary / 总结

The research motivation is to improve the efficiency and reliability of LLM-based agents in complex scenarios. The main method involves preserving successful task solutions as executable subagent code, which are continuously refined based on execution feedback. Key experimental findings show that AgentFactory enables continuous capability accumulation, with its library of executable subagents growing and improving over time, reducing the effort required for similar tasks without manual intervention.

AgentFactory 是一个自我进化的框架，通过累积和优化可执行子代理来提升基于LLM的代理。它将成功的任务解决方案记录为代码而非文本提示，并根据执行反馈持续改进。这种方法能够实现持续的能力积累，减少处理类似任务所需的努力。该框架已开源，并展示了在无需手动干预的情况下逐步提高处理任务的能力。

Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search

Authors: Sainan Liu, Tz-Ying Wu, Hector A Valdez, Subarna Tripathi

First: 2026-03-17T16:02:38+00:00 · Latest: 2026-03-18T17:58:04+00:00

Comments: 14 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

We present Search2Motion, a training-free framework for object-level motion editing in image-to-video generation. Unlike prior methods requiring trajectories, bounding boxes, masks, or motion fields, Search2Motion adopts target-frame-based control, leveraging first-last-frame motion priors to realize object relocation while preserving scene stability without fine-tuning. Reliable target-frame construction is achieved through semantic-guided object insertion and robust background inpainting. We further show that early-step self-attention maps predict object and camera dynamics, offering interpretable user feedback and motivating ACE-Seed (Attention Consensus for Early-step Seed selection), a lightweight search strategy that improves motion fidelity without look-ahead sampling or external evaluators. Noting that existing benchmarks conflate object and camera motion, we introduce S2M-DAVIS and S2M-OMB for stable-camera, object-only evaluation, alongside FLF2V-obj metrics that isolate object artifacts without requiring ground-truth trajectories. Search2Motion consistently outperforms baselines on FLF2V-obj and VBench.

中文标题/摘要

标题：Search2Motion：无需训练的对象级运动控制

我们提出了Search2Motion，一种无需训练的框架，用于图像到视频生成中的对象级运动编辑。与需要轨迹、边界框、掩码或运动场的先前方法不同，Search2Motion 采用目标帧基于的控制，利用首尾帧运动先验来实现对象重新定位，同时保持场景稳定性，无需微调。通过语义引导的对象插入和鲁棒的背景修复，实现了可靠的目标帧构建。我们进一步展示了早期步骤的自我注意力图预测对象和相机动力学，提供可解释的用户反馈，并激发了ACE-Seed（注意力共识用于早期步骤种子选择）这一轻量级搜索策略，该策略在无需前瞻采样或外部评估者的情况下提高了运动保真度。鉴于现有基准混淆了对象和相机运动，我们引入了S2M-DAVIS和S2M-OMB进行稳定相机、仅对象评估，以及FLF2V-obj指标，该指标隔离了对象伪影，无需真实轨迹。Search2Motion 在FLF2V-obj 和 VBench 上均优于基线。

The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

Authors: Yigit Ekin, Yossi Gandelsman

First: 2026-03-18T17:57:53+00:00 · Latest: 2026-03-18T17:57:53+00:00

Comments: Project Page: https://yigitekin.github.io/diffusion-sliders

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficient to produce smooth edit control. Given a target concept (e.g., enhancing photorealism or changing facial expression), we use a large language model to automatically construct a small set of debiased contrastive prompt pairs, from which we compute a steering vector in the generator's text-encoder space. We then add this vector directly to the input prompt representation to control generation along the desired semantic axis. To obtain a continuous control, we propose an elastic range search procedure that automatically identifies an effective interval of steering magnitudes, avoiding both under-steering (no-edit) and over-steering (changing other attributes). Adding the scaled versions of the same vector within this interval yields smooth and continuous edits. Since our method modifies only textual representations, it naturally generalizes across text-conditioned modalities, including image and video generation. To quantify the steering continuity, we introduce a new evaluation metric that measures the uniformity of semantic change across edit strengths. We compare the continuous editing behavior across methods and find that, despite its simplicity and lightweight design, our approach is comparable to training-based alternatives, outperforming other training-free methods.

中文标题/摘要

标题：基于文本嵌入插值的无训练连续图像操控框架

我们提出了一种无需训练的框架，在测试时对文本条件生成模型进行连续可控的图像编辑。与依赖额外训练或手动用户干预的先前方法不同，我们发现简单的文本嵌入空间中的方向调整足以产生平滑的编辑控制。给定一个目标概念（例如，增强照片真实感或改变面部表情），我们使用大型语言模型自动生成一组去偏见的对比提示对，从中计算生成器文本编码器空间中的一个方向向量。然后，我们将该向量直接添加到输入提示表示中，以沿所需的语义轴控制生成。为了获得连续控制，我们提出了一种弹性范围搜索程序，自动识别有效的方向调整幅度范围，避免过度调整（改变其他属性）和不足调整（无编辑）。在该范围内添加该向量的缩放版本可产生平滑且连续的编辑。由于我们的方法仅修改文本表示，因此自然适用于文本条件的各种模态，包括图像和视频生成。为了量化方向连续性，我们引入了一个新的评估指标，该指标衡量编辑强度下语义变化的均匀性。我们比较了不同方法的连续编辑行为，并发现尽管我们的方法简单且设计轻量，但在性能上与基于训练的替代方法相当，并优于其他无训练方法。

Summary / 总结

The paper presents a training-free framework for continuous and controllable image editing using text embeddings. By steering in the text-embedding space, the authors achieve smooth control over image generation without additional training or manual intervention. They use a large language model to generate prompt pairs and compute a steering vector, which is added to the input prompt to control the generation along desired semantic axes. An elastic range search procedure ensures continuous control by identifying an effective interval for steering magnitudes. The method is applicable across text-conditioned modalities and outperforms other training-free approaches in terms of continuous editing behavior.

论文提出了一种无需训练的框架，利用文本嵌入实现连续可控的图像编辑。通过在文本嵌入空间中进行调整，作者实现了无需额外训练或手动干预的平滑控制。他们使用大型语言模型生成提示对，并计算一个调整向量，将其添加到输入提示中以沿所需语义轴控制生成。提出了弹性范围搜索程序以确保连续且有效的调整。该方法在连续编辑行为方面与基于训练的方法相当，并且在无需训练的方法中表现出色。

LoST: Level of Semantics Tokenization for 3D Shapes

Authors: Niladri Shekhar Dutt, Zifan Shi, Paul Guerrero, Chun-Hao Paul Huang, Duygu Ceylan, Niloy J. Mitra, Xuelin Chen

Venue: CVPR 2026

First: 2026-03-18T17:56:06+00:00 · Latest: 2026-03-18T17:56:06+00:00

Comments: CVPR 2026; Project website-- https://lost3d.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remains an open question. State-of-the-art (SOTA) methods primarily rely on geometric level-of-detail (LoD) hierarchies, originally designed for rendering and compression. These spatial hierarchies are often token-inefficient and lack semantic coherence for AR modeling. We propose Level-of-Semantics Tokenization (LoST), which orders tokens by semantic salience, such that early prefixes decode into complete, plausible shapes that possess principal semantics, while subsequent tokens refine instance-specific geometric and semantic details. To train LoST, we introduce Relational Inter-Distance Alignment (RIDA), a novel 3D semantic alignment loss that aligns the relational structure of the 3D shape latent space with that of the semantic DINO feature space. Experiments show that LoST achieves SOTA reconstruction, surpassing previous LoD-based 3D shape tokenizers by large margins on both geometric and semantic reconstruction metrics. Moreover, LoST achieves efficient, high-quality AR 3D generation and enables downstream tasks like semantic retrieval, while using only 0.1%-10% of the tokens needed by prior AR models.

中文标题/摘要

标题：LoST：3D形状的语义分词级别

分词是生成建模各种模态的基本技术。特别是在自回归（AR）模型中，它起着关键作用，这些模型最近成为3D生成的有吸引力的选择。然而，3D形状的最佳分词仍然是一个开放的问题。最先进的（SOTA）方法主要依赖于几何层次细节（LoD）层次结构，这些层次结构最初是为渲染和压缩设计的。这些空间层次结构通常分词效率低下，缺乏AR建模所需的语义连贯性。我们提出了语义分词级别（LoST），它按照语义显著性对分词进行排序，使得早期前缀解码成完整的、合理的形状，具有主要的语义，而后续的分词则细化实例特定的几何和语义细节。为了训练LoST，我们引入了关系互距对齐（RIDA），这是一种新颖的3D语义对齐损失，它将3D形状潜在空间的关系结构与语义DINO特征空间的关系结构对齐。实验表明，LoST在重建方面达到了SOTA水平，与基于LoD的3D形状分词器相比，在几何和语义重建指标上取得了显著的性能提升。此外，LoST实现了高效的高质量AR 3D生成，并使下游任务如语义检索成为可能，同时仅使用了先前AR模型所需分词的0.1%-10%。

Summary / 总结

The paper introduces Level-of-Semantics Tokenization (LoST) for 3D shape generation, addressing the inefficiency of geometric level-of-detail (LoD) hierarchies in autoregressive (AR) models. LoST orders tokens based on semantic salience, allowing early prefixes to decode plausible shapes with principal semantics, and subsequent tokens to refine geometric and semantic details. The authors propose Relational Inter-Distance Alignment (RIDA) to align 3D shape latent space with semantic features. Experiments demonstrate that LoST outperforms previous LoD-based methods in geometric and semantic reconstruction, and enables efficient, high-quality AR 3D generation with fewer tokens.

论文提出了基于语义层次的3D形状标记（LoST），解决了几何层次结构在自回归模型中的效率问题。LoST根据语义显著性对标记进行排序，使得早期解码出合理的形状，后期进行细节精炼。作者提出了关系间距离对齐（RIDA），将3D形状的潜在空间与语义特征空间对齐。实验表明，LoST在几何和语义重建方面均优于之前的层次结构方法，能够高效且高质量地生成3D形状，并且仅需之前自回归模型所需标记的0.1%-10%。

Signal in the Noise: Polysemantic Interference Transfers and Predicts Cross-Model Influence

Authors: Bofan Gong, Shiyang Lai, James Evans, Dawn Song

Venue: ICLR 2026

First: 2025-05-16T18:20:42+00:00 · Latest: 2026-03-18T17:55:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Polysemanticity is pervasive in language models and remains a major challenge for interpretation and model behavioral control. Leveraging sparse autoencoders (SAEs), we map the polysemantic topology of two small models (Pythia-70M and GPT-2-Small) to identify SAE feature pairs that are semantically unrelated yet exhibit interference within models. We intervene at four foci (prompt, token, feature, neuron) and measure induced shifts in the next-token prediction distribution, uncovering polysemantic structures that expose a systematic vulnerability in these models. Critically, interventions distilled from counterintuitive interference patterns shared by two small models transfer reliably to larger instruction-tuned models (Llama-3.1-8B/70B-Instruct and Gemma-2-9B-Instruct), yielding predictable behavioral shifts without access to model internals. These findings challenge the view that polysemanticity is purely stochastic, demonstrating instead that interference structures generalize across scale and family. Such generalization suggests a convergent, higher-order organization of internal representations, which is only weakly aligned with intuition and structured by latent regularities, offering new possibilities for both black-box control and theoretical insight into human and artificial cognition.

中文标题/摘要

标题：信号与噪声：多义性干扰转移与跨模型影响预测

语言模型中的多义性普遍存在，仍然是解释和模型行为控制的主要挑战。利用稀疏自编码器（SAEs），我们将两个小型模型（Pythia-70M和GPT-2-Small）的多义性拓扑映射到识别出的SAE特征对，这些特征对在语义上不相关但在模型内部表现出干扰。我们在四个焦点（提示、标记、特征、神经元）上进行干预，并测量下一个标记预测分布的变化，揭示了多义性结构，暴露了这些模型中的系统性漏洞。关键的是，从两个小型模型共享的反直觉干扰模式中提炼出的干预措施可以可靠地转移到更大规模的指令调优模型（Llama-3.1-8B/70B-Instruct和Gemma-2-9B-Instruct），在不访问模型内部的情况下，产生可预测的行为变化。这些发现挑战了多义性纯粹是随机的观点，证明了干扰结构在规模和家族之间具有泛化性。这种泛化性表明内部表示存在一种收敛的、高级的组织形式，这种组织形式与直觉只有弱关联，并由潜在的规律性所塑造，为黑盒控制和对人类和人工认知的理论洞察提供了新的可能性。

Summary / 总结

The study addresses the challenge of polysemanticity in language models by using sparse autoencoders to map the polysemantic topology of two small models. Interventions at different levels (prompt, token, feature, neuron) reveal systematic vulnerabilities due to interference patterns. These findings show that interventions from smaller models can reliably transfer to larger models, indicating that interference structures generalize across different scales and model families, challenging the stochastic view of polysemanticity.

该研究使用稀疏自编码器映射两个小型模型的语义拓扑，通过在不同层次（提示、标记、特征、神经元）进行干预，识别出可预测地转移到更大模型中的干扰模式，挑战了语义多义性的随机性观点。研究结果表明，内部表示存在一种收敛的、更高层次的组织结构，这种结构在不同模型大小和家族之间具有泛化能力。

GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

Authors: Huajian Zeng, Abhishek Saroha, Daniel Cremers, Xi Wang

First: 2026-03-18T17:54:35+00:00 · Latest: 2026-03-18T17:54:35+00:00

Comments: Accpeted by 3DV 2026. Project Page: https://huajian-zeng.github.io/projects/gmt/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches often rely on 2D or partial 3D representations, limiting their ability to capture full scene geometry and constraining trajectory precision. We present GMT, a multimodal transformer framework that generates realistic and goal-directed object trajectories by jointly leveraging 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. The model represents trajectories as continuous 6-DOF pose sequences and employs a tailored conditioning strategy that fuses geometric, semantic, contextual, and goaloriented information. Extensive experiments on synthetic and real-world benchmarks demonstrate that GMT outperforms state-of-the-art human motion and human-object interaction baselines, such as CHOIS and GIMO, achieving substantial gains in spatial accuracy and orientation control. Our method establishes a new benchmark for learningbased manipulation planning and shows strong generalization to diverse objects and cluttered 3D environments. Project page: https://huajian- zeng.github. io/projects/gmt/.

中文标题/摘要

标题：GMT：面向6自由度物体轨迹合成的多模态变换器

在3D环境中合成可控的6自由度物体操作轨迹对于使机器人能够与复杂场景交互至关重要，但由于需要准确的空间推理、物理可行性以及多模态场景理解，这仍然是一个挑战。现有方法通常依赖于2D或部分3D表示，限制了它们捕捉完整场景几何结构的能力，从而限制了轨迹的精度。我们提出了GMT，这是一种多模态变换器框架，通过联合利用3D边界框几何、点云上下文、语义物体类别和目标末端姿态来生成现实且目标导向的物体轨迹。该模型将轨迹表示为连续的6自由度姿态序列，并采用了一种定制的条件策略，将几何、语义、上下文和目标导向的信息融合在一起。在合成和真实世界基准上的广泛实验表明，GMT在空间精度和姿态控制方面优于最先进的基于人类运动和人机交互的基线方法，如CHOIS和GIMO，实现了显著的提升。我们的方法为基于学习的操纵规划设定了新的基准，并展示了对各种物体和复杂3D环境的强大泛化能力。

Summary / 总结

GMT is a multimodal transformer framework designed to synthesize realistic 6-DOF object manipulation trajectories in 3D scenes by integrating 3D bounding box geometry, point cloud context, semantic object categories, and target end poses. It outperforms existing methods like CHOIS and GIMO in terms of spatial accuracy and orientation control, demonstrating strong generalization to various objects and cluttered environments.

GMT 是一个多模态变压器框架，旨在通过整合 3D 边界框几何、点云上下文、语义对象类别和目标末端姿态来生成现实的 6-DOF 物体操作轨迹。该模型生成连续的 6-DOF 姿态序列，并使用定制的条件策略融合几何、语义、上下文和目标导向的信息。实验表明，GMT 在空间精度和姿态控制方面优于现有基线 CHOIS 和 GIMO，为基于学习的操纵规划设立了新基准，并在各种物体和复杂 3D 环境中展示了强大的泛化能力。

Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models

Authors: Neeraj Gangwar, Suma P Bhat, Nickvash Kani

First: 2025-02-18T13:43:06+00:00 · Latest: 2026-03-18T17:43:04+00:00

Comments: Accepted to LREC 2026

Abs · PDF · Code1 · Code2

Abstract

While large models pre-trained on high-quality data exhibit excellent performance on mathematical reasoning (e.g., GSM8k, MultiArith), it remains challenging to specialize smaller models for these tasks. Common approaches to address this challenge include knowledge distillation from large teacher models and data augmentation (e.g., rephrasing questions and generating synthetic solutions). Despite these efforts, smaller models struggle with arithmetic computations, leading to errors in mathematical reasoning. In this work, we leverage a synthetic arithmetic dataset generated programmatically to enhance the reasoning capabilities of smaller models. We investigate two key approaches to incorporate this dataset: (1) intermediate fine-tuning, in which a model is fine-tuned on the arithmetic dataset before training it on a reasoning dataset, and (2) integrating the arithmetic dataset into an instruction-tuning mixture, allowing the model to learn arithmetic skills alongside general instruction-following abilities. Our experiments on multiple reasoning benchmarks demonstrate that incorporating an arithmetic dataset, whether through targeted fine-tuning or within an instruction-tuning mixture, enhances models' arithmetic capabilities, thereby improving their mathematical reasoning performance.

中文标题/摘要

标题：整合算术学习提高小型模型的数学推理能力

虽然大型预训练模型在高质量数据上表现出色（例如GSM8k、MultiArith），但将这些模型专门化以执行数学推理任务（如算术计算和推理）仍然具有挑战性。为解决这一挑战，常用的方法包括从大型教师模型进行知识蒸馏和数据增强（例如重新表述问题和生成合成解决方案）。尽管做出了这些努力，但小型模型在算术计算方面仍然存在困难，导致数学推理中的错误。在本研究中，我们利用通过编程生成的合成算术数据集来增强小型模型的推理能力。我们研究了两种关键方法来整合此数据集：（1）中间微调，即在模型在推理数据集上训练之前，先在算术数据集上进行微调；（2）将算术数据集整合到指令微调混合中，使模型在学习算术技能的同时也能学习一般指令遵循能力。我们在多个推理基准上的实验表明，无论是通过目标微调还是在指令微调混合中整合算术数据集，都能增强模型的算术能力，从而提高其数学推理性能。

Summary / 总结

This work addresses the challenge of improving mathematical reasoning in smaller models by integrating an arithmetic dataset. Two approaches are explored: intermediate fine-tuning and integrating the arithmetic dataset into an instruction-tuning mixture. The experiments show that both methods enhance the models' arithmetic capabilities, leading to better performance on mathematical reasoning tasks.

本研究旨在通过整合算术数据集来提高较小模型的数学推理能力。探索了两种方法：在算术数据集上进行中间微调，然后进行推理训练，以及将算术数据集整合到指令微调混合中。实验表明，这两种方法都能增强模型的算术能力，从而提高其数学推理性能。

Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

Authors: Shuyao Shi, Kang G. Shin

First: 2026-03-18T17:42:49+00:00 · Latest: 2026-03-18T17:42:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationally expensive 3D representations like point clouds or reconstructed Bird's-Eye View (BEV) maps, or lack physical grounding to resolve ambiguities in scale and size. This paper significantly enhances MLLMs with egomotion modality data, captured by Inertial Measurement Units (IMUs) concurrently with the video. In particular, we propose a novel framework, called Motion-MLLM, introducing two key components: (1) a cascaded motion-visual keyframe filtering module that leverages both IMU data and visual features to efficiently select a sparse yet representative set of keyframes, and (2) an asymmetric cross-modal fusion module where motion tokens serve as intermediaries that channel egomotion cues and cross-frame visual context into the visual representation. By grounding visual content in physical egomotion trajectories, Motion-MLLM can reason about absolute scale and spatial relationships across the scene. Our extensive evaluation shows that Motion-MLLM makes significant improvements in various tasks related to 3D scene understanding and spatial reasoning. Compared to state-of-the-art (SOTA) methods based on video frames and explicit 3D data, Motion-MLLM exhibits similar or even higher accuracy with significantly less overhead (i.e., 1.40$\times$ and 1.63$\times$ higher cost-effectiveness, respectively).

中文标题/摘要

标题：感知空间：基于自我运动的视频表示以实现高效准确的三维场景理解

近期多模态大型语言模型（MLLMs）在三维场景的空间推理方面显示出高潜力。然而，它们通常依赖于计算成本高昂的三维表示，如点云或重建的鸟瞰图（BEV）地图，或者缺乏物理基础来解决尺度和大小的歧义。本文通过引入自我运动模态数据显著增强了MLLMs，这些数据由惯性测量单元（IMUs）与视频同步捕获。特别是，我们提出了一种新的框架，称为Motion-MLLM，引入了两个关键组件：（1）级联运动-视觉关键帧过滤模块，利用IMU数据和视觉特征高效地选择稀疏但具有代表性的关键帧集；（2）不对称跨模态融合模块，其中运动标记作为中介，将自我运动线索和跨帧视觉上下文引导到视觉表示中。通过将视觉内容与物理自我运动轨迹相结合，Motion-MLLM 可以在场景中推理绝对尺度和空间关系。我们的广泛评估表明，Motion-MLLM 在各种与三维场景理解和空间推理相关的任务中取得了显著改进。与基于视频帧和显式三维数据的最新方法（SOTA）相比，Motion-MLLM 在成本效益方面表现出相似甚至更高的准确性（分别提高了1.40倍和1.63倍）。

Summary / 总结

This paper addresses the limitations of Multimodal Large Language Models (MLLMs) in spatial reasoning by integrating egomotion data from Inertial Measurement Units (IMUs) with video data. The proposed Motion-MLLM framework includes a cascaded motion-visual keyframe filtering module and an asymmetric cross-modal fusion module. Key experimental results show that Motion-MLLM improves accuracy in 3D scene understanding tasks and is more cost-effective than state-of-the-art methods based on video frames and explicit 3D data.

本文通过将来自惯性测量单元（IMU）的自我运动数据与视频数据结合，解决了多模态大型语言模型（MLLMs）在空间推理中的局限性。提出的Motion-MLLM框架包括级联运动-视觉关键帧过滤模块和非对称跨模态融合模块。实验结果表明，Motion-MLLM在3D场景理解任务中的准确性更高，同时比依赖于视频帧或显式3D数据的最新方法更具成本效益。

Toward Scalable Automated Repository-Level Datasets for Software Vulnerability Detection

Authors: Amine Lbath

First: 2026-03-18T17:38:35+00:00 · Latest: 2026-03-18T17:38:35+00:00

Comments: Supervisor: Prof. Massih-Reza Amini

Abs · PDF · Code1 · Code2

Abstract

Software vulnerabilities continue to grow in volume and remain difficult to detect in practice. Although learning-based vulnerability detection has progressed, existing benchmarks are largely function-centric and fail to capture realistic, executable, interprocedural settings. Recent repo-level security benchmarks demonstrate the importance of realistic environments, but their manual curation limits scale. This doctoral research proposes an automated benchmark generator that injects realistic vulnerabilities into real-world repositories and synthesizes reproducible proof-of-vulnerability (PoV) exploits, enabling precisely labeled datasets for training and evaluating repo-level vulnerability detection agents. We further investigate an adversarial co-evolution loop between injection and detection agents to improve robustness under realistic constraints.

中文标题/摘要

标题：面向软件漏洞检测的可扩展自动化仓库级数据集研究

软件漏洞的数量不断增加，并且在实践中难以检测。尽管基于学习的漏洞检测已经取得进展，但现有的基准测试主要集中在函数层面，无法捕捉到真实的、可执行的、跨过程的环境。最近的仓库级安全基准测试展示了真实环境的重要性，但它们的手动整理限制了规模。本博士研究提出了一种自动化基准生成器，该生成器将真实的漏洞注入到实际的仓库中，并合成可重复的漏洞证明（PoV）利用，从而为训练和评估仓库级漏洞检测代理提供精确标记的数据集。我们进一步研究了注入代理和检测代理之间的对抗共进化循环，以在现实约束下提高鲁棒性。

Summary / 总结

This research aims to address the challenge of detecting software vulnerabilities by proposing an automated benchmark generator that injects realistic vulnerabilities into real-world repositories and synthesizes reproducible proof-of-vulnerability (PoV) exploits. The method involves an adversarial co-evolution loop between injection and detection agents to enhance robustness. Key findings include the creation of precisely labeled datasets for training and evaluating repo-level vulnerability detection agents, which improve detection under realistic constraints.

该研究旨在通过提出一种自动基准生成器，将现实中的漏洞注入真实世界的代码库并生成可复现的漏洞证明，来解决软件漏洞检测的挑战。方法包括创建精确标注的数据集以训练和评估漏洞检测代理。实验结果表明，在现实约束下，注入和检测代理之间的对抗协同进化循环可以提高鲁棒性。

TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

Authors: Pepe Alonso

First: 2026-03-18T17:38:22+00:00 · Latest: 2026-03-18T17:38:22+00:00

Comments: Toolpaper, 7 pages, 3 tables, 1 figure, 1 algorithm. Submitted to ACM AIWare 2026 (Data and Benchmark Track)

Abs · PDF · Code1 · Code2 · Code3

Abstract

AI coding agents can resolve real-world software issues, yet they frequently introduce regressions, breaking tests that previously passed. Current benchmarks focus almost exclusively on resolution rate, leaving regression behavior under-studied. This paper presents TDAD (Test-Driven Agentic Development), an open-source tool and benchmark methodology that combines abstract-syntax-tree (AST) based code-test graph construction with weighted impact analysis to surface the tests most likely affected by a proposed change. Evaluated on SWE-bench Verified with two local models (Qwen3-Coder 30B on 100 instances and Qwen3.5-35B-A3B on 25 instances), TDAD's GraphRAG workflow reduced test-level regressions by 70% (6.08% to 1.82%) and improved resolution from 24% to 32% when deployed as an agent skill. A surprising finding is that TDD prompting alone increased regressions (9.94%), revealing that smaller models benefit more from contextual information (which tests to verify) than from procedural instructions (how to do TDD). An autonomous auto-improvement loop raised resolution from 12% to 60% on a 10-instance subset with 0% regression. These findings suggest that for AI agent tool design, surfacing contextual information outperforms prescribing procedural workflows. All code, data, and logs are publicly available at https://github.com/pepealonso95/TDAD.

中文标题/摘要

标题：TDAD：基于测试驱动的代理开发 - 通过基于图的影响分析减少AI编码代理的代码回退

AI编码代理可以解决现实世界中的软件问题，但它们经常引入回退，导致之前通过的测试失败。当前的基准测试几乎完全关注解决率，而对回退行为的研究则相对不足。本文介绍了TDAD（基于测试驱动的代理开发），这是一种开源工具和基准方法，结合了基于抽象语法树（AST）的代码-测试图构建与加权影响分析，以揭示最有可能受提议更改影响的测试。TDAD的GraphRAG工作流程在SWE-bench上验证了两个本地模型（Qwen3-Coder 30B在100个实例上和Qwen3.5-35B-A3B在25个实例上），减少了测试级别回退70%（从6.08%到1.82%），并提高了解决率从24%到32%。一个出乎意料的发现是，仅TDD提示增加了回退（9.94%），表明较小的模型从上下文信息（需要验证哪些测试）中受益更多，而不是从程序指令（如何进行TDD）中受益。自主自动改进循环在10个实例子集上将解决率从12%提高到60%，且无回退。这些发现表明，在设计AI代理工具时，呈现上下文信息优于规定程序化工作流程。所有代码、数据和日志均可在https://github.com/pepealonso95/TDAD/上公开获取。

Summary / 总结

TDAD (Test-Driven Agentic Development) is designed to reduce code regressions in AI coding agents by using a graph-based impact analysis method. Evaluated on SWE-bench with two local models, TDAD significantly reduced test-level regressions by 70% and improved resolution from 24% to 32%. The study also found that TDD prompting alone increased regressions, indicating that contextual information is more beneficial than procedural instructions for smaller models. An auto-improvement loop further enhanced resolution to 60% with no regressions. This suggests that surfacing contextual information is more effective than prescribing procedural workflows for AI agent tool design.

TDAD（Test-Driven Agentic Development）是一种工具和基准方法，利用基于图的影响分析来减少AI编码代理中的代码回退。在SWE-bench上使用两个本地模型进行评估时，TDAD的GraphRAG工作流显著减少了70%的测试级别回退，并将解决率从24%提高到32%。研究还发现，TDD提示本身会增加回退，表明较小的模型更受益于上下文信息而非程序性指令。一个自主的自动改进循环进一步将解决率提高到60%，并且在子集实例上没有回退。

Specification-Aware Distribution Shaping for Robotics Foundation Models

Authors: Sadık Bera Yüksel, Derya Aksaray

First: 2026-03-18T17:36:46+00:00 · Latest: 2026-03-18T17:36:46+00:00

Comments: 8 pages, 3 figures

Abs · PDF · Code1 · Code2

Abstract

Robotics foundation models have demonstrated strong capabilities in executing natural language instructions across diverse tasks and environments. However, they remain largely data-driven and lack formal guarantees on safety and satisfaction of time-dependent specifications during deployment. In practice, robots often need to comply with operational constraints involving rich spatio-temporal requirements such as time-bounded goal visits, sequential objectives, and persistent safety conditions. In this work, we propose a specification-aware action distribution optimization framework that enforces a broad class of Signal Temporal Logic (STL) constraints during execution of a pretrained robotics foundation model without modifying its parameters. At each decision step, the method computes a minimally modified action distribution that satisfies a hard STL feasibility constraint by reasoning over the remaining horizon using forward dynamics propagation. We validate the proposed framework in simulation using a state-of-the-art robotics foundation model across multiple environments and complex specifications.

中文标题/摘要

标题：面向规范的机器人基础模型动作分布优化

机器人基础模型在执行跨多种任务和环境的自然语言指令方面表现出强大的能力。然而，在部署过程中，它们仍然主要依赖数据驱动，缺乏关于安全性和时间依赖规范满足性的形式保证。实际上，机器人经常需要遵守涉及丰富时空要求的操作约束，如时间限制的目标访问、顺序目标和持续的安全条件。在本工作中，我们提出了一种面向规范的动作分布优化框架，在执行预训练的机器人基础模型时强制执行广泛的信号时序逻辑（STL）约束，而不修改其参数。在每个决策步骤中，该方法通过使用前向动力学传播来推理剩余的时域，计算一个最小修改的动作分布，以满足硬STL可行性约束。我们使用最先进的机器人基础模型在多个环境中和复杂规范下在仿真中验证了所提出的框架。

Summary / 总结

This work addresses the challenge of ensuring safety and compliance with time-dependent specifications for robotics foundation models. The authors propose a specification-aware action distribution optimization framework that enforces Signal Temporal Logic (STL) constraints during execution without altering the model parameters. At each decision step, the method computes a minimally modified action distribution to satisfy a hard STL feasibility constraint using forward dynamics propagation. The framework is validated in simulation with a state-of-the-art robotics foundation model across various environments and complex specifications.

研究旨在增强机器人基础模型的安全性和遵守操作约束的能力。提出了一种规格感知的动作分布优化框架，确保模型在不修改其参数的情况下遵循信号时序逻辑（STL）约束。该方法在每一步计算一个修改后的动作分布，以满足硬STL可行性约束，使用前向动力学传播进行推理。模拟实验表明，该框架在多种环境和复杂规格下均有效。

LaDe: Unified Multi-Layered Graphic Media Generation and Decomposition

Authors: Vlad-Constantin Lungu-Stan, Ionut Mironica, Mariana-Iuliana Georgescu

First: 2026-03-18T17:34:07+00:00 · Latest: 2026-03-18T17:34:07+00:00

Comments: 18 pages (main + supp)

Abs · PDF · Code1 · Code2

Abstract

Media design layer generation enables the creation of fully editable, layered design documents such as posters, flyers, and logos using only natural language prompts. Existing methods either restrict outputs to a fixed number of layers or require each layer to contain only spatially continuous regions, causing the layer count to scale linearly with design complexity. We propose LaDe (Layered Media Design), a latent diffusion framework that generates a flexible number of semantically meaningful layers. LaDe combines three components: an LLM-based prompt expander that transforms a short user intent into structured per-layer descriptions that guide the generation, a Latent Diffusion Transformer with a 4D RoPE positional encoding mechanism that jointly generates the full media design and its constituent RGBA layers, and an RGBA VAE that decodes each layer with full alpha-channel support. By conditioning on layer samples during training, our unified framework supports three tasks: text-to-image generation, text-to-layers media design generation, and media design decomposition. We compare LaDe to Qwen-Image-Layered on text-to-layers and image-to-layers tasks on the Crello test set. LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as validated by two VLM-as-a-judge evaluators (GPT-4o mini and Qwen3-VL).

中文标题/摘要

标题：LaDe：统一多层图形媒体生成与分解

媒体设计层生成能够仅通过自然语言提示创建完全可编辑的分层设计文档，如海报、传单和标志。现有方法要么限制输出层数为固定数量，要么要求每层仅包含连续的空间区域，导致层数随设计复杂度线性增加。我们提出LaDe（分层媒体设计），这是一种潜扩散框架，能够生成灵活数量的语义上有意义的分层。LaDe 结合了三个组件：基于LLM的提示扩展器，将简短的用户意图转换为分层结构化的描述，以指导生成；具有4D RoPE位置编码机制的潜扩散变换器，能够同时生成完整的媒体设计及其构成的RGBA分层；以及RGBA VAE，能够支持每个分层的完整alpha通道解码。通过在训练期间条件化分层样本，我们的统一框架支持三个任务：文本到图像生成、文本到分层媒体设计生成以及媒体设计分解。我们在Crello测试集上将LaDe与Qwen-Image-Layered在文本到分层和图像到分层任务上进行比较。LaDe在文本到分层生成方面优于Qwen-Image-Layered，通过两个VLM作为评判者（GPT-4o mini和Qwen3-VL）验证了文本到分层对齐的改进。

Summary / 总结

The research aims to generate and decompose graphic media with flexible layer counts using natural language prompts. LaDe, a latent diffusion framework, combines a prompt expander, a Latent Diffusion Transformer, and an RGBA VAE to generate and decompose media designs. Experiments show that LaDe outperforms Qwen-Image-Layered in text-to-layers generation by improving text-to-layer alignment, as evaluated by VLMs.

研究旨在使用自然语言提示创建可编辑的分层设计文档。LaDe 是一个潜扩散框架，通过结合提示扩展器、潜扩散变换器和RGBA VAE 生成具有语义意义的灵活层数。实验表明，LaDe 在文本到层生成任务中优于 Qwen-Image-Layered，改善了文本到层的对齐，得到了 VLM-as-a-judge 评估者的验证。

ConGA: Guidelines for Contextual Gender Annotation. A Framework for Annotating Gender in Machine Translation

Authors: Argentina Anna Rescigno, Eva Vanmassenhove, Johanna Monti

First: 2026-03-18T17:31:47+00:00 · Latest: 2026-03-18T17:31:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Handling gender across languages remains a persistent challenge for Machine Translation (MT) and Large Language Models (LLMs), especially when translating from gender-neutral languages into morphologically gendered ones, such as English to Italian. English largely omits grammatical gender, while Italian requires explicit agreement across multiple grammatical categories. This asymmetry often leads MT systems to default to masculine forms, reinforcing bias and reducing translation accuracy. To address this issue, we present the Contextual Gender Annotation (ConGA) framework, a linguistically grounded set of guidelines for word-level gender annotation. The scheme distinguishes between semantic gender in English through three tags, Masculine (M), Feminine (F), and Ambiguous (A), and grammatical gender realisation in Italian (Masculine (M), Feminine (F)), combined with entity-level identifiers for cross-sentence tracking. We apply ConGA to the gENder-IT dataset, creating a gold-standard resource for evaluating gender bias in translation. Our results reveal systematic masculine overuse and inconsistent feminine realisation, highlighting persistent limitations of current MT systems. By combining fine-grained linguistic annotation with quantitative evaluation, this work offers both a methodology and a benchmark for building more gender-aware and multilingual NLP systems.

中文标题/摘要

标题：ConGA：背景性别注释指南。一种机器翻译中性别标注的框架

跨语言处理性别问题仍然是机器翻译（MT）和大型语言模型（LLMs）的一个持续挑战，尤其是在从性别中立语言翻译到形态上具有性别的语言时，例如从英语到意大利语。英语基本上省略了语法性别，而意大利语则需要在多个语法类别中进行显式一致。这种不对称性通常导致MT系统默认使用阳性形式，从而强化偏见并降低翻译准确性。为了解决这一问题，我们提出了背景性别注释（ConGA）框架，这是一种基于语言的单词级别性别注释指南。该方案通过三个标签区分英语中的语义性别（阳性M、阴性F、模糊A），以及意大利语中的语法性别实现（阳性M、阴性F），并结合实体级别标识符进行跨句跟踪。我们使用ConGA对gENder-IT数据集进行了处理，创建了一个用于评估翻译中性别偏见的黄金标准资源。我们的结果揭示了系统性的阳性过度使用和不一致的阴性实现，突显了当前MT系统中存在的持续局限性。通过结合精细的语义注释和定量评估，这项工作提供了一种方法和基准，用于构建更具性别意识和多语言的NLP系统。

Summary / 总结

The research aims to address the gender bias in machine translation, particularly from gender-neutral languages like English to gendered languages like Italian. The Contextual Gender Annotation (ConGA) framework is introduced, which provides linguistically grounded guidelines for gender annotation at the word level. The framework distinguishes between semantic and grammatical gender and includes entity-level identifiers for cross-sentence tracking. Applying ConGA to the gENder-IT dataset, the study finds systematic overuse of masculine forms and inconsistent feminine realisation, indicating that current MT systems still struggle with gender bias.

研究针对机器翻译中性别处理的挑战，特别是从性别中立的语言如英语到性别化的语言如意大利语的翻译。研究引入了上下文性别注释（ConGA）框架，提供了基于语言的性别注释指南，区分英语中的语义性别和意大利语中的语法性别。研究将ConGA应用于gENder-IT数据集，揭示了系统性地过度使用男性性别和不一致的女性实现，表明当前MT系统中存在持续的性别偏见。

Den-TP: A Density-Balanced Data Curation and Evaluation Framework for Trajectory Prediction

Authors: Ruining Yang, Yi Xu, Yun Fu, Lili Su

First: 2024-09-25T22:00:11+00:00 · Latest: 2026-03-18T17:31:09+00:00

Comments: Accepted by CVPR2026

Abs · PDF · Code1 · Code2

Abstract

Trajectory prediction in autonomous driving has traditionally been studied from a model-centric perspective. However, existing datasets exhibit a strong long-tail distribution in scenario density, where common low-density cases dominate and safety-critical high-density cases are severely underrepresented. This imbalance limits model robustness and hides failure modes when standard evaluations average errors across all scenarios. We revisit trajectory prediction from a data-centric perspective and present Den-TP, a framework for density-aware dataset curation and evaluation. Den-TP first partitions data into density-conditioned regions using agent count as a dataset-agnostic proxy for interaction complexity. It then applies a gradient-based submodular selection objective to choose representative samples within each region while explicitly rebalancing across densities. The resulting subset reduces the dataset size by 50\% yet preserves overall performance and significantly improves robustness in high-density scenarios. We further introduce density-conditioned evaluation protocols that reveal long-tail failure modes overlooked by conventional metrics. Experiments on Argoverse 1 and 2 with state-of-the-art models show that robust trajectory prediction depends not only on data scale, but also on balancing scenario density.

中文标题/摘要

标题：Den-TP：一种基于密度平衡的数据整理与评估框架用于轨迹预测

自动驾驶中的轨迹预测传统上是从模型为中心的角度进行研究的。然而，现有的数据集在场景密度上表现出强烈的长尾分布，其中常见的低密度情况占主导地位，而关键的安全高密度情况严重不足。这种不平衡限制了模型的鲁棒性，并且在标准评估中平均所有场景的误差时隐藏了失败模式。我们从数据为中心的角度重新审视轨迹预测，并提出了Den-TP，一种基于密度的数据整理与评估框架。Den-TP 首先使用代理交互复杂性的车辆计数将数据划分为密度条件区域，然后使用基于梯度的子模性选择目标来选择每个区域内的代表性样本，同时明确地重新平衡密度。由此产生的子集将数据集大小减少了50%，但保留了整体性能，并显著提高了在高密度场景中的鲁棒性。我们还引入了基于密度的评估协议，揭示了常规指标所忽视的长尾失败模式。在Argoverse 1和2上的实验表明，鲁棒的轨迹预测不仅依赖于数据规模，还依赖于场景密度的平衡。

Summary / 总结

The paper addresses the imbalance in trajectory prediction datasets, where common low-density scenarios dominate while high-density, safety-critical scenarios are underrepresented. It proposes Den-TP, a framework that partitions data into density-conditioned regions and selects representative samples using a gradient-based submodular selection objective. This approach reduces dataset size by 50% while maintaining overall performance and improving robustness in high-density scenarios. Additionally, it introduces density-conditioned evaluation protocols that highlight long-tail failure modes not captured by conventional metrics.

论文针对轨迹预测数据集中的不平衡问题，即常见低密度场景占主导，而高密度、安全性关键场景严重不足。提出了Den-TP框架，将数据按密度条件分区，并使用基于梯度的子模优化目标选择代表性样本，平衡各密度区间。这导致数据集规模减少50%，但仍保持整体性能，并显著提高高密度场景的鲁棒性。此外，还提出了基于密度的评估协议，以揭示常规指标忽视的长尾失败模式。

Provably Safe Model Updates

Authors: Leo Elmecker-Plakolm, Pierre Fasterling, Philip Sosnin, Calvin Tsay, Matthew Wicker

First: 2025-12-01T17:19:53+00:00 · Latest: 2026-03-18T17:29:55+00:00

Comments: 12 pages, 9 figures. This work has been accepted for publication at SaTML 2026. The final version will be available on IEEE Xplore

Abs · PDF · Code1 · Code2

Abstract

Safety-critical environments are inherently dynamic. Distribution shifts, emerging vulnerabilities, and evolving requirements demand continuous updates to machine learning models. Yet even benign parameter updates can have unintended consequences, such as catastrophic forgetting in classical models or alignment drift in foundation models. Existing heuristic approaches (e.g., regularization, parameter isolation) can mitigate these effects but cannot certify that updated models continue to satisfy required performance specifications. We address this problem by introducing a framework for provably safe model updates. Our approach first formalizes the problem as computing the largest locally invariant domain (LID): a connected region in parameter space where all points are certified to satisfy a given specification. While exact maximal LID computation is intractable, we show that relaxing the problem to parameterized abstract domains (orthotopes, zonotopes) yields a tractable primal-dual formulation. This enables efficient certification of updates - independent of the data or algorithm used - by projecting them onto the safe domain. Our formulation further allows computation of multiple approximately optimal LIDs, incorporation of regularization-inspired biases, and use of lookahead data buffers. Across continual learning and foundation model fine-tuning benchmarks, our method matches or exceeds heuristic baselines for avoiding forgetting while providing formal safety guarantees.

中文标题/摘要

标题：可验证安全的模型更新

关键安全环境本质上是动态的。分布变化、新兴漏洞和不断变化的要求需要持续更新机器学习模型。然而，即使是看似无害的参数更新也可能产生意想不到的后果，例如经典模型中的灾难性遗忘或基础模型中的对齐漂移。现有的启发式方法（例如正则化、参数隔离）可以减轻这些影响，但无法保证更新后的模型继续满足所需性能规范。我们通过引入一个可验证安全的模型更新框架来解决这个问题。我们的方法首先将问题形式化为计算最大的局部不变域（LID）：参数空间中的一个连通区域，其中所有点都得到验证，满足给定的规范。虽然精确的最大LID计算是不可行的，但我们证明将问题松弛到参数化抽象域（正交体、zonotope）可以得到一个可解的对偶公式。这使得独立于所用数据或算法，可以通过将更新投影到安全域来进行高效的验证。我们的形式化还允许计算多个近似最优的LID，纳入正则化启发式的偏差，并使用前瞻数据缓冲区。在持续学习和基础模型微调基准测试中，我们的方法在避免遗忘方面与启发式基线相当或更优，并提供了正式的安全保证。

Summary / 总结

The research addresses the challenge of safely updating machine learning models in dynamic environments, where distribution shifts and evolving requirements necessitate continuous model updates. It introduces a framework that computes the largest locally invariant domain (LID) to ensure that updated models meet required performance specifications. The method relaxes the problem to parameterized abstract domains, enabling efficient certification of updates and providing formal safety guarantees. Experiments show that the approach matches or exceeds heuristic baselines in avoiding forgetting across various benchmarks.

论文解决了动态环境中持续更新机器学习模型带来的安全挑战，这些环境中的分布变化和不断变化的要求需要持续更新模型。它提出了一种框架，通过计算最大的局部不变域（LID）来确保更新后的模型满足性能规范。该方法将问题松弛到参数化的抽象域，从而能够高效地认证更新，并提供正式的安全保证。实验结果显示，该方法在避免遗忘方面与启发式基线相当或更优，同时确保了安全性。

Learning Over Dirty Data with Minimal Repairs

Authors: Cheng Zhen, Prayoga, Nischal Aryal, Arash Termehchy, Garrett Biwer, Lubna Alzamil

First: 2025-03-18T05:36:59+00:00 · Latest: 2026-03-18T17:28:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Missing data often exists in real-world datasets, requiring significant time and effort for data repair to learn accurate models. In this paper, we show that imputing all missing values is not always necessary to achieve an accurate ML model. We introduce concepts of minimal and almost minimal repair, which are subsets of missing data items in training data whose imputation delivers accurate and reasonably accurate models, respectively. Imputing these subsets can significantly reduce the time, computational resources, and manual effort required for learning. We show that finding these subsets is NP-hard for some popular models and propose efficient approximation algorithms for wide range of models. Our extensive experiments indicate that our proposed algorithms can substantially reduce the time and effort required to learn on incomplete datasets.

中文标题/摘要

标题：学习脏数据中的知识需要最少的修复

实际世界的数据集中经常存在缺失数据，这需要大量时间和努力进行数据修复以学习准确的模型。在本文中，我们展示了填充所有缺失值并非总是必要的，以获得准确的机器学习模型。我们引入了最小修复和几乎最小修复的概念，它们分别是训练数据中缺失数据项的子集，其填充可以分别提供准确和相对准确的模型。填充这些子集可以显著减少学习所需的时间、计算资源和人工努力。我们证明了对于一些流行的模型，找到这些子集是NP难问题，并提出了适用于广泛模型的高效近似算法。我们的大量实验表明，我们提出的算法可以显著减少在不完整数据集上学习所需的时间和努力。

Summary / 总结

This paper addresses the challenge of missing data in real-world datasets by introducing the concepts of minimal and almost minimal repair. These are subsets of missing data items whose imputation can lead to accurate and reasonably accurate models, respectively. The authors propose efficient approximation algorithms for various models, demonstrating that these methods can significantly reduce the time and effort needed for learning on incomplete datasets.

该论文通过引入最小修复和几乎最小修复的概念，解决了现实世界数据集中缺失数据的挑战。作者表明，只需填充缺失值的一部分而非全部，即可获得准确的机器学习模型。他们为各种模型提出了高效的近似算法，并通过实验表明，这些方法可以显著减少在不完整数据集上进行学习所需的时间和努力。

Gender Disambiguation in Machine Translation: Diagnostic Evaluation in Decoder-Only Architectures

Authors: Chiara Manna, Hosein Mohebbi, Afra Alishahi, Frédéric Blain, Eva Vanmassenhove

First: 2026-03-18T17:26:36+00:00 · Latest: 2026-03-18T17:26:36+00:00

Abs · PDF · Code1 · Code2

Abstract

While Large Language Models achieve state-of-the-art results across a wide range of NLP tasks, they remain prone to systematic biases. Among these, gender bias is particularly salient in MT, due to systematic differences across languages in whether and how gender is marked. As a result, translation often requires disambiguating implicit source signals into explicit gender-marked forms. In this context, standard benchmarks may capture broad disparities but fail to reflect the full complexity of gender bias in modern MT. In this paper, we extend recent frameworks on bias evaluation by: (i) introducing a novel measure coined "Prior Bias", capturing a model's default gender assumptions, and (ii) applying the framework to decoder-only MT models. Our results show that, despite their scale and state-of-the-art status, decoder-only models do not generally outperform encoder-decoder architectures on gender-specific metrics; however, post-training (e.g., instruction tuning) not only improves contextual awareness but also reduces the masculine Prior Bias.

中文标题/摘要

标题：机器翻译中的性别消歧：解码器-only架构中的诊断评估

尽管大型语言模型在广泛的语言处理任务中取得了最先进的成果，但它们仍然容易受到系统性偏见的影响。其中，性别偏见在机器翻译中尤为突出，因为不同语言在性别标记方面存在系统性差异。因此，翻译往往需要将隐含的源信号消歧为明确的性别标记形式。在这种背景下，标准基准可能捕捉到广泛的差异，但未能反映现代机器翻译中性别偏见的全部复杂性。在本文中，我们通过以下方式扩展了最近的偏见评估框架：(i) 引入了一种新的度量标准“先验偏见”，捕捉模型的默认性别假设，(ii) 将框架应用于解码器-only模型。我们的结果显示，尽管它们的规模和最先进的地位，解码器-only模型在性别特定指标上并不普遍优于编码器-解码器架构；然而，后训练（例如，指令调优）不仅提高了上下文意识，还减少了男性先验偏见。

Summary / 总结

This paper addresses the gender bias in machine translation, particularly in decoder-only architectures. It introduces a new measure called 'Prior Bias' to capture a model's default gender assumptions and evaluates it along with existing benchmarks. The study finds that despite their large scale and state-of-the-art performance, decoder-only models do not generally outperform encoder-decoder models on gender-specific metrics. However, post-training methods like instruction tuning can improve both contextual awareness and reduce the masculine Prior Bias.

本文探讨了机器翻译中的性别偏见问题，特别是在解码器-only架构中的表现。研究引入了一种新的衡量标准‘Prior Bias’，用于捕捉模型的默认性别假设，并结合现有基准进行了评估。研究发现，尽管这些模型规模庞大且性能先进，但在性别特定的指标上，它们通常不如编码器-解码器模型表现好。然而，通过指令调优等后训练技术可以提高上下文意识并减少男性Prior Bias。

VideoAtlas: Navigating Long-Form Video in Logarithmic Compute

Authors: Mohamed Eltahir, Ali Habibullah, Yazan Alshoibi, Lama Ayash, Tanveer Hussain, Naeemullah Khan

First: 2026-03-18T17:20:19+00:00 · Latest: 2026-03-18T17:20:19+00:00

Abs · PDF · Code1 · Code2

Abstract

Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce \textbf{VideoAtlas}, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance, and any region can be recursively zoomed into, with the same visual representation used uniformly for the video, intermediate investigations, and the agent's memory, eliminating lossy text conversion end-to-end. This hierarchical structure ensures access depth grows only logarithmically with video length. For long-context, Recursive Language Models (RLMs) recently offered a powerful solution for long text, but extending them to visual domain requires a structured environment to recurse into, which \textbf{VideoAtlas} provides. \textbf{VideoAtlas} as a Markov Decision Process unlocks Video-RLM: a parallel Master-Worker architecture where a Master coordinates global exploration while Workers concurrently drill into assigned regions to accumulate lossless visual evidence. We demonstrate three key findings: (1)~logarithmic compute growth with video duration, further amplified by a 30-60\% multimodal cache hit rate arising from the grid's structural reuse. (2)~environment budgeting, where bounding the maximum exploration depth provides a principled compute-accuracy hyperparameter. (3)~emergent adaptive compute allocation that scales with question granularity. When scaling from 1-hour to 10-hour benchmarks, Video-RLM remains the most duration-robust method with minimal accuracy degradation, demonstrating that structured environment navigation is a viable and scalable paradigm for video understanding.

中文标题/摘要

标题：VideoAtlas：在对数计算中导航长视频

将语言模型扩展到视频引入了两个挑战：表示，其中现有方法依赖于有损近似；以及长上下文，其中字幕或代理管道将视频压缩为文本并失去视觉保真度。为了解决这个问题，我们引入了**VideoAtlas**，这是一种任务无关的环境，将视频表示为同时无损、可导航、可扩展且无需字幕和预处理的分层网格。视频的概览一目了然，任何区域都可以递归放大，使用相同的视觉表示用于视频、中间调查和代理的记忆，从而在整个过程中消除有损文本转换。这种分层结构确保访问深度仅以视频长度的对数增长。对于长上下文，递归语言模型（RLMs）最近为长文本提供了一种强大的解决方案，但将其扩展到视觉领域需要一个结构化的环境来进行递归，这正是**VideoAtlas**提供的。**VideoAtlas**作为马尔可夫决策过程解锁了Video-RLM：一种并行的主从架构，其中主节点协调全局探索，而工人同时钻入分配的区域以积累无损视觉证据。我们展示了三个关键发现：(1) 视频长度与计算增长呈对数关系，进一步受到30-60%多模态缓存命中率的影响，这是由于网格结构的重用。(2) 环境预算，其中限制最大探索深度提供了一个有原则的计算-准确度超参数。(3) 适应性计算分配的出现，其与问题的粒度成比例。当从1小时扩展到10小时基准时，Video-RLM仍然是最具有长度鲁棒性的方法，准确度下降最小，这表明结构化环境导航是视频理解的一种可行且可扩展的范式。

Summary / 总结

VideoAtlas is designed to address the challenges of representing long-form video and maintaining visual fidelity by introducing a hierarchical grid representation that is lossless and scalable. The system uses a Markov Decision Process to enable a Master-Worker architecture for efficient exploration and evidence gathering. Key findings include logarithmic compute growth with video duration, a principled compute-accuracy hyperparameter through environment budgeting, and adaptive compute allocation based on question granularity. Video-RLM maintains robust accuracy even when scaling from 1-hour to 10-hour benchmarks.

VideoAtlas通过引入无损且可导航的层次网格表示解决了长视频的表示和保持视觉保真度的挑战。它使用马尔可夫决策过程来实现高效的探索和证据积累。关键发现包括随视频时长呈对数增长的计算增长，通过环境预算实现计算-准确性的原则性超参数，以及基于问题粒度的自适应计算分配。当从1小时扩展到10小时基准时，Video-RLM在保持最小准确度下降的情况下表现出最稳健的性能，证明了结构化环境导航是视频理解的一种可行且可扩展的范式。

Unified Policy Value Decomposition for Rapid Adaptation

Authors: Cristiano Capone, Luca Falorsi, Andrea Ciardiello, Luca Manneschi

First: 2026-03-18T17:19:56+00:00 · Latest: 2026-03-18T17:19:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Rapid adaptation in complex control systems remains a central challenge in reinforcement learning. We introduce a framework in which policy and value functions share a low-dimensional coefficient vector - a goal embedding - that captures task identity and enables immediate adaptation to novel tasks without retraining representations. During pretraining, we jointly learn structured value bases and compatible policy bases through a bilinear actor-critic decomposition. The critic factorizes as Q = sum_k G_k(g) y_k(s,a), where G_k(g) is a goal-conditioned coefficient vector and y_k(s,a) are learned value basis functions. This multiplicative gating - where a context signal scales a set of state-dependent bases - is reminiscent of gain modulation observed in Layer 5 pyramidal neurons, where top-down inputs modulate the gain of sensory-driven responses without altering their tuning. Building on Successor Features, we extend the decomposition to the actor, which composes a set of primitive policies weighted by the same coefficients G_k(g). At test time the bases are frozen and G_k(g) is estimated zero-shot via a single forward pass, enabling immediate adaptation to novel tasks without any gradient update. We train a Soft Actor-Critic agent on the MuJoCo Ant environment under a multi-directional locomotion objective, requiring the agent to walk in eight directions specified as continuous goal vectors. The bilinear structure allows each policy head to specialize to a subset of directions, while the shared coefficient layer generalizes across them, accommodating novel directions by interpolating in goal embedding space. Our results suggest that shared low-dimensional goal embeddings offer a general mechanism for rapid, structured adaptation in high-dimensional control, and highlight a potentially biologically plausible principle for efficient transfer in complex reinforcement learning systems.

中文标题/摘要

标题：统一策略价值分解以实现快速适应

在复杂控制系统中实现快速适应仍然是强化学习中的核心挑战。我们提出了一种框架，在该框架中，策略和价值函数共享一个低维系数向量——目标嵌入，该向量捕捉任务身份并使模型能够无需重新训练表示即刻适应新任务。在预训练过程中，我们通过双线性演员-评论分解联合学习结构化价值基和兼容的策略基。评论因子化为Q = ∑_k G_k(g) y_k(s,a)，其中G_k(g)是目标条件系数向量，y_k(s,a)是学习的价值基函数。这种乘法门控——上下文信号缩放一组状态依赖基——类似于在层5锥形神经元中观察到的增益调制现象，其中上行输入调节感觉驱动响应的增益而不改变其调谐。基于后继特征，我们将分解扩展到演员，该演员由一组加权相同系数G_k(g)的原始策略组成。在测试时，基底冻结，G_k(g)通过单次前向传播零样本估计，从而无需任何梯度更新即可立即适应新任务。我们在MuJoCo蚂蚁环境中训练一个软演员-评论家代理，目标是在八个方向上实现多方向行走，这些方向以连续的目标向量指定。双线性结构允许每个策略头专门化于一组方向，而共享的系数层则在它们之间泛化，通过在目标嵌入空间内插来适应新方向。我们的结果表明，共享的低维目标嵌入提供了一种在高维控制中实现快速、结构化适应的一般机制，并突显了在复杂强化学习系统中高效迁移的一种潜在生物合理原则。

Summary / 总结

The paper addresses the challenge of rapid adaptation in complex control systems through a unified policy and value function framework. During pretraining, a bilinear actor-critic decomposition is used to learn structured value bases and compatible policy bases, with a shared goal embedding capturing task identity. At test time, the bases are frozen and the goal embedding is estimated zero-shot, allowing immediate adaptation to novel tasks. The method is validated on a MuJoCo Ant environment, demonstrating that shared low-dimensional goal embeddings enable rapid and structured adaptation in high-dimensional control tasks.

论文提出了一种统一框架，其中策略和价值函数共享一个低维系数向量，称为目标嵌入，以解决复杂控制系统的快速适应问题。该框架通过双线性演员-评论家分解在预训练阶段联合学习结构化的价值基和兼容的策略基。实验结果表明，这种方法能使Soft Actor-Critic代理快速适应MuJoCo Ant环境中的新运动方向，证明了共享目标嵌入在高维控制任务中实现快速适应的有效性。

CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

Authors: Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, Shuaiwen Leon Song

Venue: ICLR 2026

First: 2026-03-18T17:18:35+00:00 · Latest: 2026-03-18T17:18:35+00:00

Comments: Accepted at ICLR 2026. Conference paper. 10 pages main text; 34 pages total including references and appendix. 11 figures and 20 tables in total

Abs · PDF · Code1 · Code2

Abstract

Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ignore the covariance structure of activations, and enforce uniform rank across layers, causing activation drift and degraded attention fidelity. To address these issues, we propose CARE, a Covariance-Aware, Rank-Enhanced MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i) activation-preserving factorization, which aligns the approximation with the actual input activations rather than just the weights; (ii) adjusted-rank allocation, which spreads a fixed KV budget across layers by giving more capacity to layers that need it most; and (iii) KV-parity mapping, which reparameterizes the converted K and V to fit the MLA format while keeping the KV-cache size unchanged. Our method outperforms a uniform-rank SVD baseline on Qwen3-4B/30B-A3B-Instruct-2507 and Llama-3.1-8B/70B-Instruct, reducing one-shot perplexity by up to 215x and improving mean accuracy by up to 1.70x at matched KV budgets. With a brief post-SVD healing fine-tune, we fully recover the original model's accuracy.

中文标题/摘要

标题：CARE：协方差感知和秩增强分解以实现多头潜在注意力

将预训练的注意力模块（如分组查询注意力GQA）转换为多头潜在注意力MLA，可以在不增加KV缓存成本的情况下提高表达能力，使其在高效推理方面具有吸引力。然而，许多实际的转换基线依赖于权重的低秩近似（例如SVD风格的初始化）和均匀的秩分配。它们专注于最小化权重矩阵之间的差异，而不是这些权重如何影响输入激活，忽略了激活的协方差结构，并在各层之间强制执行均匀的秩，导致激活漂移和注意力保真度下降。为了解决这些问题，我们提出了CARE，一种在固定KV宽度下的协方差感知和秩增强MLA转换管道。CARE引入了三个关键步骤：（i）激活保持因子分解，使近似与实际输入激活对齐，而不仅仅是权重；（ii）调整后的秩分配，通过给需要最多容量的层分配更多的预算来在各层之间分配固定的KV预算；（iii）KV一致性映射，重新参数化转换后的K和V以适应MLA格式，同时保持KV缓存大小不变。我们的方法在Qwen3-4B/30B-A3B-Instruct-2507和Llama-3.1-8B/70B-Instruct上优于均匀秩SVD基线，将单次困惑度降低多达215倍，并在匹配的KV预算下将平均准确性提高多达1.70倍。通过简短的后SVD修复微调，我们完全恢复了原始模型的准确性。

Summary / 总结

The research aims to improve the expressivity of multi-head latent attention (MLA) by converting pretrained attention modules like grouped-query attention (GQA) without increasing KV-cache cost. CARE, a Covariance-Aware and Rank-Enhanced decomposition method, addresses issues of weight-only low-rank approximations and uniform rank allocation. It introduces activation-preserving factorization, adjusted-rank allocation, and KV-parity mapping. CARE outperforms a uniform-rank SVD baseline, reducing one-shot perplexity by up to 215x and improving mean accuracy by up to 1.70x at matched KV budgets.

研究旨在通过将分组查询注意力（GQA）模块转换为多头潜在注意力（MLA）来增强其表达能力，而不增加KV缓存成本。CARE，一种协方差感知和秩增强的分解方法，引入了激活保持因子分解、调整秩分配和KV一致性映射来解决现有方法的局限性。CARE在匹配的KV预算下，将一-shot困惑度降低最多215倍，平均准确性提高最多1.70倍，优于均匀秩SVD基线。

100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

Authors: Yeounoh Chung, Rushabh Desai, Jian He, Yu Xiao, Thibaud Hottelier, Yves-Laurent Kom Samo, Pushkar Kadilkar, Xianshun Chen, Sam Idicula, Fatma Özcan, Alon Halevy, Yannis Papakonstantinou

First: 2026-03-16T22:42:45+00:00 · Latest: 2026-03-18T17:17:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Several data warehouse and database providers have recently introduced extensions to SQL called AI Queries, enabling users to specify functions and conditions in SQL that are evaluated by LLMs, thereby broadening significantly the kinds of queries one can express over the combination of structured and unstructured data. LLMs offer remarkable semantic reasoning capabilities, making them an essential tool for complex and nuanced queries that blend structured and unstructured data. While extremely powerful, these AI queries can become prohibitively costly when invoked thousands of times. This paper provides an extensive evaluation of a recent AI query approximation approach that enables low cost analytics and database applications to benefit from AI queries. The approach delivers >100x cost and latency reduction for the semantic filter ($AI.IF$) operator and also important gains for semantic ranking ($AI.RANK$). The cost and performance gains come from utilizing cheap and accurate proxy models over embedding vectors. We show that despite the massive gains in latency and cost, these proxy models preserve accuracy and occasionally improve accuracy across various benchmark datasets, including the extended Amazon reviews benchmark that has 10M rows. We present an OLAP-friendly architecture within Google BigQuery for this approach for purely online (ad hoc) queries, and a low-latency HTAP database-friendly architecture in AlloyDB that could further improve the latency by moving the proxy model training offline. We present techniques that accelerate the proxy model training.

中文标题/摘要

标题：100倍成本与延迟降低：轻量级代理模型在AI查询近似中的性能分析

许多数据仓库和数据库提供商最近引入了名为AI查询的SQL扩展，使用户能够指定由LLM评估的SQL函数和条件，从而极大地扩展了用户可以表达的查询类型，包括结构化和非结构化数据的组合。LLM提供了出色的语义推理能力，使其成为复杂和细腻查询的重要工具，这些查询混合了结构化和非结构化数据。虽然非常强大，但这些AI查询在被数千次调用时可能会变得极其昂贵。本文对一种最近的AI查询近似方法进行了全面评估，该方法使低成本分析和数据库应用程序能够受益于AI查询。该方法在语义过滤器（$AI.IF$）操作符上实现了超过100倍的成本和延迟降低，并且在语义排名（$AI.RANK$）方面也取得了重要进展。成本和性能的提升来自于使用嵌入向量的廉价且准确的代理模型。我们展示了尽管在延迟和成本方面取得了巨大的提升，这些代理模型仍然保持了准确性，并且在各种基准数据集中偶尔提高了准确性，包括扩展的亚马逊评论基准数据集，该数据集包含1000万行数据。我们为这种方法在Google BigQuery中提供了一个适合OLAP的架构，用于纯在线（即兴）查询，并在AlloyDB中提供了一个低延迟HTAP数据库友好的架构，通过将代理模型训练移出线下来进一步提高延迟。我们介绍了加速代理模型训练的技术。

Summary / 总结

This paper evaluates an AI query approximation approach that uses lightweight proxy models to reduce the cost and latency of AI queries in data warehouses and databases. The method leverages embedding vectors to achieve over 100x cost and latency reduction for the semantic filter and semantic ranking operators. The proxy models maintain or even improve accuracy across various benchmark datasets, including the extended Amazon reviews benchmark with 10 million rows. The approach is implemented in an OLAP-friendly architecture within Google BigQuery and a low-latency HTAP database-friendly architecture in AlloyDB, which further optimizes latency by training the proxy models offline.

该论文评估了一种使用轻量级代理模型来减少数据仓库和数据库中AI查询的成本和延迟的方法。该方法利用嵌入向量实现了对语义过滤和语义排名操作超过100倍的成本和延迟减少。代理模型在包括扩展后的亚马逊评论基准数据集（包含1000万行）在内的各种基准数据集上保持或甚至提高了准确性。该方法在Google BigQuery中实现了OLAP友好的架构，并在AlloyDB中实现了低延迟的HTAP数据库友好架构，通过将代理模型的训练移出线程进一步优化了延迟。

A Comprehensive Benchmark of Histopathology Foundation Models for Kidney Digital Pathology Images

Authors: Harishwar Reddy Kasireddy, Patricio S. La Rosa, Akshita Gupta, Anindya S. Paul, Jamie L. Fermin, William L. Clapp, Meryl A. Waldman, Tarek M. El-Ashkar, Sanjay Jain, Luis Rodrigues, Kuang Yu Jen, Avi Z. Rosenberg, Michael T. Eadon, Jeffrey B. Hodgin, Pinaki Sarder

First: 2026-03-16T22:37:43+00:00 · Latest: 2026-03-18T17:17:27+00:00

Comments: 31 Pages, 14 Tables, 12 figures, Co-correspondence to jhodgin@med.umich.edu and pinaki.sarder@ufl.edu

Abs · PDF · Code1 · Code2 · Project1

Abstract

Histopathology foundation models (HFMs), pretrained on large-scale cancer datasets, have advanced computational pathology. However, their applicability to non-cancerous chronic kidney disease remains underexplored, despite coexistence of renal pathology with malignancies such as renal cell and urothelial carcinoma. We systematically evaluate 11 publicly available HFMs across 11 kidney-specific downstream tasks spanning multiple stains (PAS, H&E, PASM, and IHC), spatial scales (tile and slide-level), task types (classification, regression, and copy detection), and clinical objectives, including detection, diagnosis, and prognosis. Tile-level performance is assessed using repeated stratified group cross-validation, while slide-level tasks are evaluated using repeated nested stratified cross-validation. Statistical significance is examined using Friedman test followed by pairwise Wilcoxon signed-rank testing with Holm-Bonferroni correction and compact letter display visualization. To promote reproducibility, we release an open-source Python package, kidney-hfm-eval, available at https://pypi.org/project/kidney-hfm-eval/ , that reproduces the evaluation pipelines. Results show moderate to strong performance on tasks driven by coarse meso-scale renal morphology, including diagnostic classification and detection of prominent structural alterations. In contrast, performance consistently declines for tasks requiring fine-grained microstructural discrimination, complex biological phenotypes, or slide-level prognostic inference, largely independent of stain type. Overall, current HFMs appear to encode predominantly static meso-scale representations and may have limited capacity to capture subtle renal pathology or prognosis-related signals. Our results highlight the need for kidney-specific, multi-stain, and multimodal foundation models to support clinically reliable decision-making in nephrology.

中文标题/摘要

标题：肾数字病理图像综合基准测试：基于组织病理学的基础模型

基于组织病理学的基础模型（HFMs），在大规模癌症数据集上预训练，已推动计算病理学的发展。然而，它们在非癌性慢性肾病中的应用尚未得到充分探索，尽管肾病理与肾细胞癌和尿路上皮癌共存。我们系统地评估了11个公开可用的HFMs在11个涵盖多种染色（PAS、H&E、PASM和IHC）、空间尺度（切片和滑块级别）、任务类型（分类、回归和拷贝检测）以及临床目标（包括检测、诊断和预后）的肾特异性下游任务中的表现。切片级别的性能使用重复分层组交叉验证评估，而滑块级别的任务则使用重复嵌套分层交叉验证评估。统计显著性通过Friedman检验后，使用霍尔姆-邦费罗尼校正的成对Wilcoxon符号秩检验和紧凑字母显示可视化进行检验。为了促进可重复性，我们发布了一个开源Python包，kidney-hfm-eval，可在https://pypi.org/project/kidney-hfm-eval/ 处获取，该包可重现评估管道。结果表明，HFMs在由粗尺度肾形态驱动的任务中表现出中等到较强的表现，包括诊断分类和显著结构改变的检测。相比之下，对于需要精细微结构区分、复杂生物表型或滑块级别预后推断的任务，其表现一致下降，这在很大程度上与染色类型无关。总体而言，当前的HFMs似乎主要编码静态的粗尺度表示，可能难以捕捉到细微的肾病理或与预后相关的信号。我们的结果强调了需要针对肾的、多染色和多模态的基础模型，以支持肾病学中的临床可靠决策。

Efficient Training-Free Multi-Token Prediction via Embedding-Space Probing

Authors: Raghavv Goel, Mukul Gagrani, Mingu Lee, Chris Lott

First: 2026-03-18T17:14:01+00:00 · Latest: 2026-03-18T17:14:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) exhibit latent multi-token prediction (MTP) capabilities despite being trained solely for next-token generation. We propose a simple, training-free MTP approach that probes an LLM using on-the-fly mask tokens drawn from its embedding space, enabling parallel prediction of future tokens without modifying model weights or relying on auxiliary draft models. Our method constructs a speculative token tree by sampling top-K candidates from mask-token logits and applies a lightweight pruning strategy to retain high-probability continuations. During decoding, candidate predictions are verified in parallel, resulting in lossless generation while substantially reducing the number of model calls and improving token throughput. Across benchmarks, our probing-based MTP consistently outperforms existing training-free baselines, increasing acceptance length by approximately 12\% on LLaMA3 and 8--12\% on Qwen3, and achieving throughput gains of up to 15--19\%. Finally, we provide theoretical insights and empirical evidence showing that decoder layers naturally align mask-token representations with next-token states, enabling accurate multi-step prediction without retraining or auxiliary models.

中文标题/摘要

标题：通过嵌入空间探测实现高效的无训练多令牌预测

大型语言模型（LLMs）在仅被训练用于生成下一个令牌的情况下，表现出潜在的多令牌预测（MTP）能力。我们提出了一种简单的、无训练的MTP方法，通过从其嵌入空间中随机生成掩码令牌来探测LLM，从而可以在不修改模型权重或依赖辅助草稿模型的情况下并行预测未来令牌。该方法通过从掩码令牌的logits中采样前K个候选者来构建推测令牌树，并应用轻量级的剪枝策略以保留高概率的延续。在解码过程中，候选预测并行验证，从而实现无损生成，同时显著减少模型调用次数并提高令牌吞吐量。在基准测试中，我们的基于探测的MTP方法始终优于现有的无训练基线，在LLaMA3上增加接受长度约12%，在Qwen3上增加8-12%，并实现高达15-19%的吞吐量增益。最后，我们提供了理论见解和实验证据，表明解码器层自然地将掩码令牌表示与下一个令牌状态对齐，从而在无需重新训练或辅助模型的情况下实现准确的多步预测。

Summary / 总结

The research aims to leverage the latent multi-token prediction (MTP) capabilities of large language models (LLMs) without training. It proposes a training-free method that uses on-the-fly mask tokens from the embedding space to predict future tokens in parallel. The method constructs a speculative token tree by sampling top-K candidates and applies pruning to retain high-probability continuations. This approach achieves lossless generation with reduced model calls and improved token throughput, outperforming existing training-free baselines by increasing acceptance length and improving throughput.

研究旨在利用大型语言模型（LLM）的潜在多令牌预测（MTP）能力，而不进行训练。提出了一种训练-free 方法，使用嵌入空间中的即用即抛掩码令牌并行预测未来令牌。该方法通过采样 top-K 候选者构建推测令牌树，并应用剪枝策略保留高概率延续。这种方法实现了无损生成，减少了模型调用次数并提高了令牌吞吐量，优于现有训练-free 基线，增加了接受长度并提高了吞吐量。

Interpretable Traffic Responsibility from Dashcam Video via Legal Multi Agent Reasoning

Authors: Jingchun Yang, Jinchang Zhang

First: 2026-03-18T17:04:48+00:00 · Latest: 2026-03-18T17:04:48+00:00

Abs · PDF · Code1 · Code2

Abstract

The widespread adoption of dashcams has made video evidence in traffic accidents increasingly abundant, yet transforming "what happened in the video" into "who is responsible under which legal provisions" still relies heavily on human experts. Existing ego-view traffic accident studies mainly focus on perception and semantic understanding, while LLM-based legal methods are mostly built on textual case descriptions and rarely incorporate video evidence, leaving a clear gap between the two. We first propose C-TRAIL, a multimodal legal dataset that, under the Chinese traffic regulation system, explicitly aligns dashcam videos and textual descriptions with a closed set of responsibility modes and their corresponding Chinese traffic statutes. On this basis, we introduce a two-stage framework: (1) a traffic accident understanding module that generates textual video descriptions; and (2) a legal multi-agent framework that outputs responsibility modes, statute sets, and complete judgment reports. Experimental results on C-TRAIL and MM-AU show that our method outperforms general and legal LLMs, as well as existing agent-based approaches, while providing a transparent and interpretable legal reasoning process.

中文标题/摘要

标题：基于法律多智能体推理的可解释交通责任从行车记录仪视频解读

行车记录仪的广泛应用使得交通事故视频证据日益丰富，但将“视频中发生了什么”转化为“根据哪些法律规定谁应承担责任”仍主要依赖于人类专家。现有的以自我视角为主的交通事故研究主要集中在感知和语义理解上，而基于LLM的法律方法大多基于文本案例描述，很少包含视频证据，这在两者之间留下了一个明显的差距。我们首先提出了C-TRAIL，这是一个多模态法律数据集，在中国的交通法规体系下，明确地将行车记录仪视频和文本描述与一组封闭的责任模式及其对应的中国交通法规进行了对齐。在此基础上，我们引入了一个两阶段框架：(1) 交通事故理解模块，生成视频文本描述；(2) 法律多智能体框架，输出责任模式、法规集合和完整的判决报告。C-TRAIL和MM-AU上的实验结果显示，我们的方法优于通用和法律LLM以及现有的基于代理的方法，同时提供了一个透明和可解释的法律推理过程。

Summary / 总结

The paper addresses the challenge of converting dashcam video evidence into legal responsibility under traffic regulations by proposing C-TRAIL, a multimodal dataset, and a two-stage framework. The first stage generates textual descriptions of traffic accidents from videos, and the second stage uses a legal multi-agent framework to determine responsibility and legal statutes. The method outperforms general and legal language models and existing agent-based approaches, offering transparent and interpretable legal reasoning.

该研究通过提出C-TRAIL多模态法律数据集，解决了交通事故中视频证据与法律责任之间的差距。研究引入了两阶段框架：交通事故理解模块从行车记录仪视频生成文本描述，以及法律多代理框架输出责任模式和判决报告。该方法在一般和法律LLM以及现有基于代理的方法中表现出色，提供了透明和可解释的法律推理过程。

A practical artificial intelligence framework for legal age estimation using clavicle computed tomography scans

Authors: Javier Venema, Stefano De Luca, Pablo Mesejo, Óscar Ibáñez

First: 2026-03-18T17:02:01+00:00 · Latest: 2026-03-18T17:02:01+00:00

Comments: 15 pages, 8 figures, submitted to Engineering Applications of Artificial Intelligence

Abs · PDF · Code1 · Code2

Abstract

Legal age estimation plays a critical role in forensic and medico-legal contexts, where decisions must be supported by accurate, robust, and reproducible methods with explicit uncertainty quantification. While prior artificial intelligence (AI)-based approaches have primarily focused on hand radiographs or dental imaging, clavicle computed tomography (CT) scans remain underexplored despite their documented effectiveness for legal age estimation. In this work, we present an interpretable, multi-stage pipeline for legal age estimation from clavicle CT scans. The proposed framework combines (i) a feature-based connected-component method for automatic clavicle detection that requires minimal manual annotation, (ii) an Integrated Gradients-guided slice selection strategy used to construct the input data for a multi-slice convolutional neural network that estimates legal age, and (iii) conformal prediction intervals to support uncertainty-aware decisions in accordance with established international protocols. The pipeline is evaluated on 1,158 full-body post-mortem CT scans from a public forensic dataset (the New Mexico Decedent Image Database). The final model achieves state-of-the-art performance with a mean absolute error (MAE) of 1.55 $\pm$ 0.16 years on a held-out test set, outperforming both human experts (MAE of approximately 1.90 years) and previous methods (MAEs above 1.75 years in our same dataset). Furthermore, conformal prediction enables configurable coverage levels aligned with forensic requirements. Attribution maps indicate that the model focuses on anatomically relevant regions of the medial clavicular epiphysis. The proposed method, which is currently being added as part of the Skeleton-ID software (https://skeleton-id.com/skeleton-id/), is intended as a decision-support component within multi-factorial forensic workflows.

中文标题/摘要

标题：使用锁骨计算机断层扫描图像的人工智能框架在法律年龄估计中的实际应用

法律年龄估计在法医和医学法律领域中起着关键作用，需要准确、稳健且可重复的方法，并且需要明确的不确定性量化。尽管先前的人工智能（AI）方法主要集中在手部放射照相或牙科成像上，但锁骨计算机断层（CT）扫描尽管已被证明对法律年龄估计有效，但仍未得到充分探索。本文提出了一种用于从锁骨CT扫描中进行法律年龄估计的可解释多阶段管道。所提出的框架结合了（i）一种基于特征的连通区域方法，用于自动锁骨检测，需要最少的手动注释；（ii）一种基于Integrated Gradients的切片选择策略，用于构建输入数据，以供多切片卷积神经网络估计法律年龄；（iii）一种符合国际标准的区间预测方法，以支持不确定性意识决策。该管道在公共法医数据集（新墨西哥遗体图像数据库）中的1,158例全身死后CT扫描上进行了评估。最终模型在保留的测试集上实现了最先进的性能，平均绝对误差（MAE）为1.55 ± 0.16岁，优于人类专家（约1.90岁）和先前方法（在相同数据集中MAE超过1.75岁）。此外，区间预测可以配置覆盖水平以满足法医要求。归因图表明，模型专注于内侧锁骨骺板的解剖相关区域。所提出的方法目前正作为Skeleton-ID软件的一部分（https://skeleton-id.com/skeleton-id/）的一部分进行开发，旨在作为多因素法医工作流程中的决策支持组件。

Summary / 总结

This study presents an AI framework for legal age estimation using clavicle CT scans, addressing the need for accurate and robust methods with uncertainty quantification. The framework includes an automatic clavicle detection method, an Integrated Gradients-guided slice selection strategy, and conformal prediction intervals. Evaluated on 1,158 post-mortem CT scans, the model achieved a mean absolute error of 1.55 years, outperforming human experts and previous methods. Conformal prediction supports configurable coverage levels, and attribution maps show the model focuses on relevant anatomical regions.

该研究提出了一种基于锁骨CT扫描的法律年龄估计AI框架，旨在提供准确且稳健的方法，并量化不确定性。该框架包括自动锁骨检测方法、Integrated Gradients引导的切片选择策略以及置信预测区间。在1,158例尸检CT扫描上评估，模型达到了1.55年的平均绝对误差，优于人类专家和先前方法。置信预测支持符合法医要求的可配置覆盖水平，并将该方法整合到Skeleton-ID软件中作为决策支持工具。

SegFly: A 2D-3D-2D Paradigm for Aerial RGB-Thermal Semantic Segmentation at Scale

Authors: Markus Gross, Sai Bharadhwaj Matha, Rui Song, Viswanathan Muthuveerappan, Conrad Christoph, Julius Huber, Daniel Cremers

First: 2026-03-18T16:57:22+00:00 · Latest: 2026-03-18T16:57:22+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Semantic segmentation for uncrewed aerial vehicles (UAVs) is fundamental for aerial scene understanding, yet existing RGB and RGB-T datasets remain limited in scale, diversity, and annotation efficiency due to the high cost of manual labeling and the difficulties of accurate RGB-T alignment on off-the-shelf UAVs. To address these challenges, we propose a scalable geometry-driven 2D-3D-2D paradigm that leverages multi-view redundancy in high-overlap aerial imagery to automatically propagate labels from a small subset of manually annotated RGB images to both RGB and thermal modalities within a unified framework. By lifting less than 3% of RGB images into a semantic 3D point cloud and reprojecting it into all views, our approach enables dense pseudo ground-truth generation across large image collections, automatically producing 97% of RGB labels and 100% of thermal labels while achieving 91% and 88% annotation accuracy without any 2D manual refinement. We further extend this 2D-3D-2D paradigm to cross-modal image registration, using 3D geometry as an intermediate alignment space to obtain fully automatic, strong pixel-level RGB-T alignment with 87% registration accuracy and no hardware-level synchronization. Applying our framework to existing geo-referenced aerial imagery, we construct SegFly, a large-scale benchmark with over 20,000 high-resolution RGB images and more than 15,000 geometrically aligned RGB-T pairs spanning diverse urban, industrial, and rural environments across multiple altitudes and seasons. On SegFly, we establish the Firefly baseline for RGB and thermal semantic segmentation and show that both conventional architectures and vision foundation models benefit substantially from SegFly supervision, highlighting the potential of geometry-driven 2D-3D-2D pipelines for scalable multi-modal scene understanding. Data and Code available at https://github.com/markus-42/SegFly.

中文标题/摘要

标题：SegFly：大规模空地RGB-热成像语义分割的二维-三维-二维范式

无人驾驶航空器（UAV）的语义分割是空中场景理解的基础，但现有的RGB和RGB-T数据集在规模、多样性和注释效率方面仍然受限，由于手动标注成本高以及现成UAV上RGB-T对齐的困难。为了解决这些挑战，我们提出了一种可扩展的几何驱动的二维-三维-二维范式，利用高重叠空中多视角冗余来自动从少量手动标注的RGB图像中传播标签到统一框架内的RGB和热成像模态。通过将不到3%的RGB图像提升到语义三维点云并重新投影到所有视图中，我们的方法能够在大规模图像集合中生成密集的伪地面真值，自动产生97%的RGB标签和100%的热成像标签，同时在无需任何二维手动细化的情况下达到91%和88%的注释准确率。我们进一步将二维-三维-二维范式扩展到跨模态图像配准，使用三维几何作为中间对齐空间，获得完全自动的强像素级RGB-T对齐，注册准确率为87%，无需硬件级同步。将我们的框架应用于现有地理参考的空中图像，我们构建了SegFly，一个包含超过20,000张高分辨率RGB图像和超过15,000个几何对齐的RGB-T配对的大规模基准，这些配对跨越了多种海拔和季节的多样城市、工业和农村环境。在SegFly上，我们建立了Firefly基线用于RGB和热成像语义分割，并展示了传统架构和视觉基础模型从SegFly监督中受益显著，突显了几何驱动的二维-三维-二维管道在多模态场景理解中的潜力。数据和代码可在https://github.com/markus-42/SegFly/获取。

Only relative ranks matter in weight-clustered large language models

Authors: Borja Aizpurua, Sukhbinder Singh, Román Orús

First: 2026-03-18T16:55:13+00:00 · Latest: 2026-03-18T16:55:13+00:00

Comments: 10 pages, 3 figures, 9 tables

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) contain billions of parameters, yet many exact values are not essential. We show that what matters most is the relative rank of weights-whether one connection is stronger or weaker than another-rather than precise magnitudes. To reduce the number of unique weight values, we apply weight clustering to pretrained models, replacing every weight matrix with K shared values from K-means. For Llama 3.1-8B-Instruct and SmolLM2-135M, reducing each matrix to only 16-64 distinct values preserves strong accuracy without retraining, providing a simple, training-free method to compress LLMs on disk. Optionally fine-tuning only the cluster means (centroids) recovers 30-40 percent of the remaining accuracy gap at minimal cost. We then systematically randomize cluster means while keeping assignments fixed. Scrambling the relative ranks of the clusters degrades quality sharply-perplexity can increase by orders of magnitude-even when global statistics such as mean and variance are preserved. In contrast, rank-preserving randomizations cause almost no loss at mid and late layers. On the other hand, when many layers are perturbed simultaneously, progressive layer-by-layer replacement reveals that scale drift-not rank distortion-is the dominant collapse mechanism; however, an affine correction w' = aw + b with a > 0 (which preserves both rank order and overall weight distribution) can substantially delay this drift. This rank-based perspective offers a new lens on model compression and robustness.

中文标题/摘要

标题：在重量聚类的大语言模型中，相对排名最重要

大语言模型（LLMs）包含数十亿个参数，但许多精确值并不重要。我们表明，最重要的是权重的相对排名——一个连接是否比另一个更强或更弱，而不是精确的大小。为了减少唯一的权重值数量，我们对预训练模型应用权重聚类，将每个权重矩阵替换为K-means的K个共享值。对于Llama 3.1-8B-Instruct和SmolLM2-135M，将每个矩阵减少到仅16-64个不同的值，可以在不重新训练的情况下保持强大的准确性，提供了一种简单且无需训练的方法来压缩LLM在磁盘上的大小。可选地仅微调聚类均值（质心），可以以极小的成本恢复30-40％的剩余准确性差距。然后系统地随机化聚类均值，同时保持分配固定。打乱聚类的相对排名会急剧降低质量——困惑度可以增加几个数量级——即使全局统计量如均值和方差保持不变。相比之下，在中间和后期层中，保持排名不变的随机化几乎不会造成损失。另一方面，当同时扰动许多层时，逐层逐层的替换显示，尺度漂移而不是排名失真是主要的崩溃机制；然而，一个仿射修正w'=aw+b（其中a>0，既保持了排名顺序，又保持了整体权重分布）可以显著延迟这种漂移。基于排名的观点为模型压缩和鲁棒性提供了一个新的视角。

Summary / 总结

The research aims to explore the importance of relative weight ranks in large language models (LLMs) and proposes a method to reduce the number of unique weight values through weight clustering. By applying K-means clustering, the study shows that reducing each matrix to 16-64 distinct values preserves strong accuracy without retraining. Fine-tuning only the cluster means recovers some lost accuracy. Randomizing cluster means while keeping assignments fixed degrades model quality significantly, indicating the critical role of relative ranks. However, rank-preserving randomizations cause minimal loss in mid and late layers, suggesting that scale drift rather than rank distortion is the primary issue when many layers are perturbed simultaneously.

研究旨在探索大型语言模型（LLMs）中相对权重等级的重要性，并提出了一种通过权重聚类减少唯一权重值数量的方法。通过应用K-means聚类，研究显示将每个矩阵减少到16-64个不同的值可以保持较强的准确性且无需重新训练。仅微调聚类均值可以恢复部分丢失的准确性。在保持分配不变的情况下随机化聚类均值会导致模型质量显著下降，表明相对等级的重要性。然而，在许多层同时扰动时，逐层替换显示尺度漂移而非等级扭曲是主要问题，而保持等级顺序和整体权重分布的仿射修正（a > 0）可以显著延缓这种漂移。基于等级的观点为模型压缩和鲁棒性提供了新的视角。

History

20260319_0353 20260318_0401 20260317_0403 20260316_0333 20260315_0330 20260314_0336 20260313_0346 20260312_0346 20260311_0342 20260310_0345 20260309_0327 20260308_0327 20260307_0339 20260306_0356 20260305_0342 20260303_0342 20260301_0326 20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553