OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents
Authors: Akashah Shabbir, Muhammad Umer Sheikh, Muhammad Akhtar Munir, Hiyam Debary, Mustansar Fiaz, Muhammad Zaigham Zaheer, Paolo Fraccaro, Fahad Shahbaz Khan, Muhammad Haris Khan, Xiao Xiang Zhu, Salman Khan
First: 2026-02-19T18:59:54+00:00 · Latest: 2026-02-19T18:59:54+00:00
Abstract
Recent progress in multimodal reasoning has enabled agents that can interpret imagery, connect it with language, and perform structured analytical tasks. Extending such capabilities to the remote sensing domain remains challenging, as models must reason over spatial scale, geographic structures, and multispectral indices while maintaining coherent multi-step logic. To bridge this gap, OpenEarthAgent introduces a unified framework for developing tool-augmented geospatial agents trained on satellite imagery, natural-language queries, and detailed reasoning traces. The training pipeline relies on supervised fine-tuning over structured reasoning trajectories, aligning the model with verified multistep tool interactions across diverse analytical contexts. The accompanying corpus comprises 14,538 training and 1,169 evaluation instances, with more than 100K reasoning steps in the training split and over 7K reasoning steps in the evaluation split. It spans urban, environmental, disaster, and infrastructure domains, and incorporates GIS-based operations alongside index analyses such as NDVI, NBR, and NDBI. Grounded in explicit reasoning traces, the learned agent demonstrates structured reasoning, stable spatial understanding, and interpretable behaviour through tool-driven geospatial interactions across diverse conditions. We report consistent improvements over a strong baseline and competitive performance relative to recent open and closed-source models.
中文标题/摘要
标题:OpenEarthAgent:统一的工具增强地理空间代理框架
近期多模态推理的进步使代理能够解释图像、将其与语言连接起来并执行结构化分析任务。将此类能力扩展到遥感领域仍然具有挑战性,因为模型必须在保持连贯的多步逻辑的同时,在空间尺度、地理结构和多光谱指数上进行推理。为弥合这一差距,OpenEarthAgent 引入了一个统一框架,用于开发基于卫星图像、自然语言查询和详细推理轨迹训练的工具增强地理空间代理。训练管道依赖于结构化推理轨迹的监督微调,使模型与跨多种分析上下文的验证多步工具交互对齐。伴随的语料库包括14,538个训练实例和1,169个评估实例,训练集中有超过100,000个推理步骤,评估集中有超过7,000个推理步骤。它涵盖了城市、环境、灾害和基础设施领域,并结合了GIS操作和NDVI、NBR和NDBI等指数分析。基于显式的推理轨迹,学习到的代理展示了结构化的推理、稳定的空间理解和通过工具驱动的地理空间交互在多种条件下可解释的行为。我们报告了相对于强大基线的一致改进,并且在与最近的开源和闭源模型相比时表现出竞争力。
Summary / 总结
The research aims to develop geospatial agents capable of handling complex tasks in the remote sensing domain, such as interpreting satellite imagery and performing structured analytical tasks. The method involves training a unified framework called OpenEarthAgent using a large dataset of satellite imagery, natural-language queries, and reasoning traces. Key findings show that the agent improves upon a strong baseline and performs competitively compared to recent models, demonstrating structured reasoning and stable spatial understanding in various geospatial contexts.
OpenEarthAgent 是一个统一框架,用于开发能够解释卫星图像并执行结构化分析任务的工具增强型地理空间代理。该框架通过大量推理轨迹和工具交互的数据集进行监督微调。关键发现包括相对于强基线的一致改进以及与最近模型相比的竞争性表现,展示了在多种地理空间情境下的结构化推理和稳定的空间理解能力。
Sink-Aware Pruning for Diffusion Language Models
Authors: Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen
First: 2026-02-19T18:59:50+00:00 · Latest: 2026-02-19T18:59:50+00:00
Comments: Code at: https://github.com/VILA-Lab/Sink-Aware-Pruning
Abstract
Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose ${\bf \texttt{Sink-Aware Pruning}}$, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at https://github.com/VILA-Lab/Sink-Aware-Pruning.
中文标题/摘要
标题:面向降噪的剪枝方法用于扩散语言模型
扩散语言模型(DLMs)由于迭代去噪而产生高昂的推理成本,因此需要高效的剪枝方法。现有的剪枝启发式方法大多源自自回归(AR)语言模型,通常会保留注意力的“sink”令牌,因为AR中的sink充当了稳定的全局锚点。我们表明,对于DLMs来说,这种假设并不成立:注意力的sink位置在整个生成轨迹中表现出显著更高的变化性(通过衡量主导sink位置在时间步之间的变化来衡量),这表明sink往往是暂时的,其结构重要性远不如AR模型中的sink。基于这一观察,我们提出了${\bf \texttt{Sink-Aware Pruning}}$,该方法自动识别并剪枝DLMs中的不稳定sink(先前的研究通常会保留sink用于AR语言模型)。无需重新训练,我们的方法在质量-效率权衡上表现更好,并在匹配计算资源的情况下优于强大的先验剪枝基线。我们的代码可在https://github.com/VILA-Lab/Sink-Aware-Pruning获取。
Summary / 总结
The research addresses the high inference cost of Diffusion Language Models (DLMs) by proposing Sink-Aware Pruning, which identifies and prunes unstable attention sink tokens. Unlike existing methods that preserve sink tokens due to their assumed stability in autoregressive models, this study shows that sink positions in DLMs are highly variable. The proposed method improves the quality-efficiency trade-off without retraining and outperforms previous pruning techniques under similar computational resources.
论文针对扩散语言模型(DLMs)的高推理成本,提出了基于注意力下陷位置的自适应剪枝方法,自动识别并剪枝不稳定的下陷位置。不同于现有方法因下陷位置在自回归模型中的稳定性而保留这些位置,作者展示了DLMs中的下陷位置在生成过程中高度变化。这使得在不重新训练的情况下,能够获得更好的质量和效率权衡,并在相同计算资源下优于先前的剪枝基线方法。
CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts
Authors: Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello, Maud Ehrmann, Simon Clematide
First: 2026-02-19T18:59:44+00:00 · Latest: 2026-02-19T18:59:44+00:00
Comments: ECIR 2026. CLEF Evaluation Lab. Registration DL: 2026/04/23. Task Homepage at https://hipe-eval.github.io/HIPE-2026/
Abstract
HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts. Building on the HIPE-2020 and HIPE-2022 campaigns, it extends the series toward semantic relation extraction by targeting the task of identifying person--place associations in multiple languages and time periods. Systems are asked to classify relations of two types - $at$ ("Has the person ever been at this place?") and $isAt$ ("Is the person located at this place around publication time?") - requiring reasoning over temporal and geographical cues. The lab introduces a three-fold evaluation profile that jointly assesses accuracy, computational efficiency, and domain generalization. By linking relation extraction to large-scale historical data processing, HIPE-2026 aims to support downstream applications in knowledge-graph construction, historical biography reconstruction, and spatial analysis in digital humanities.
中文标题/摘要
标题:CLEF HIPE-2026:从多语言历史文本中准确高效地提取人物地点关系
HIPE-2026 是一个CLEF评估实验室,专注于从嘈杂的多语言历史文本中提取人物地点关系。在HIPE-2020和HIPE-2022活动的基础上,它将系列扩展到语义关系提取,通过在多个语言和时期内识别人物与地点的关联来完成任务。系统被要求对两种类型的关系进行分类——$at$(“这个人是否曾经在过这个地方?”)和 $isAt$(“这个人是否在发布时间附近位于这个地方?”),这需要对时间与地理线索进行推理。该实验室引入了三方面的评估标准,共同评估准确性、计算效率和领域泛化能力。通过将关系提取与大规模历史数据处理联系起来,HIPE-2026旨在支持知识图谱构建、历史传记重建和数字人文中的空间分析等下游应用。
Summary / 总结
HIPE-2026 is an evaluation lab under CLEF that focuses on extracting person-place relations from multilingual historical texts. Systems are evaluated based on their accuracy in classifying two types of relations, $at$ and $isAt$, and their computational efficiency and domain generalization. The lab introduces a three-fold evaluation profile to assess these aspects jointly. By processing large-scale historical data, the lab aims to support applications in knowledge-graph construction and historical biography reconstruction.
HIPE-2026 是一个CLEF下的评估实验室,专注于从多语言历史文本中抽取人物地点关系。系统根据其对两种关系类型 $at$ 和 $isAt$ 的分类准确性、计算效率和领域泛化能力进行评估。该实验室引入了三方面的评估标准,以综合评估这些方面。通过处理大规模历史数据,该实验室旨在支持知识图谱构建和历史传记重建等应用。
MARS: Margin-Aware Reward-Modeling with Self-Refinement
Authors: Payel Bhattacharjee, Osvaldo Simeone, Ravi Tandon
First: 2026-02-19T18:59:03+00:00 · Latest: 2026-02-19T18:59:03+00:00
Abstract
Reward modeling is a core component of modern alignment pipelines including RLHF and RLAIF, underpinning policy optimization methods including PPO and TRPO. However, training reliable reward models relies heavily on human-labeled preference data, which is costly and limited, motivating the use of data augmentation. Existing augmentation approaches typically operate at the representation or semantic level and remain agnostic to the reward model's estimation difficulty. In this paper, we propose MARS, an adaptive, margin-aware augmentation and sampling strategy that explicitly targets ambiguous and failure modes of the reward model. Our proposed framework, MARS, concentrates augmentation on low-margin (ambiguous) preference pairs where the reward model is most uncertain, and iteratively refines the training distribution via hard-sample augmentation. We provide theoretical guarantees showing that this strategy increases the average curvature of the loss function hence enhance information and improves conditioning, along with empirical results demonstrating consistent gains over uniform augmentation for robust reward modeling.
中文标题/摘要
标题:MARS:基于边距感知的自改进奖励建模
奖励建模是现代对齐管道(包括RLHF和RLAIF)的核心组件,支撑着包括PPO和TRPO在内的策略优化方法。然而,训练可靠的奖励模型依赖于昂贵且有限的人工标注偏好数据,这促使了数据增强方法的应用。现有的增强方法通常在表示或语义层面操作,对奖励模型的估计难度保持中立。在本文中,我们提出了MARS,一种自适应的、基于边距的增强和采样策略,明确针对奖励模型的模糊和失败模式。我们提出的框架MARS集中在低边距(模糊)偏好对上进行增强,这些偏好对使奖励模型最不确定,并通过困难样本增强迭代优化训练分布。我们提供了理论保证,表明这种策略增加了损失函数的平均曲率,从而增强了信息并改善了条件性,并通过实验证明了与均匀增强相比的一致改进,以实现稳健的奖励建模。
Summary / 总结
The paper introduces MARS, a margin-aware reward-modeling approach that focuses on augmenting ambiguous preference pairs to improve the reward model's reliability. MARS iteratively refines the training distribution by concentrating on low-margin cases where the model is most uncertain and using hard-sample augmentation. Theoretical guarantees show that this strategy enhances the loss function's curvature and improves conditioning, while empirical results indicate consistent gains over uniform augmentation for robust reward modeling.
MARS 是一种基于边际的增强和采样策略,专注于模糊的偏好对以提高奖励模型的鲁棒性。该框架通过硬样本增强迭代细化训练分布,针对奖励模型最不确定的低边际情况。理论保证显示,这种方法增加了损失函数的平均曲率,增强了信息量并改善了条件性。实验证明,与均匀增强相比,该方法在鲁棒奖励建模中具有一致的改进。
What Language is This? Ask Your Tokenizer
Authors: Clara Meister, Ahmetcan Yavuz, Pietro Lesci, Tiago Pimentel
First: 2026-02-19T18:58:39+00:00 · Latest: 2026-02-19T18:58:39+00:00
Abstract
Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its probabilistic framing, parameter estimation technique and inference strategy. In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into existing language model tokenization pipelines. Empirical evaluations against widely used baselines, including fastText, GlotLID, and CLD3, show that UniLID achieves competitive performance on standard benchmarks, substantially improves sample efficiency in low-resource settings - surpassing 70% accuracy with as few as five labeled samples per language - and delivers large gains on fine-grained dialect identification.
中文标题/摘要
标题:这是什么语言?问你的分词器
语言识别(LID)是许多多语言自然语言处理流水线的重要组成部分,它有助于语料库整理、训练数据分析和大规模语言模型的跨语言评估。尽管在高资源语言上表现近乎完美,现有系统在低资源和密切相关语言环境中仍然脆弱。我们引入了UniLID,这是一种基于UnigramLM分词算法的简单高效LID方法,利用其概率框架、参数估计技术和推理策略。简而言之,我们学习基于共享分词器词汇的语言条件单字分布,但将分词视为一种语言特定的现象。我们的公式化方法在数据和计算效率方面表现出色,支持无需重新训练现有模型即可逐步添加新语言,并且可以自然地集成到现有的语言模型分词流水线中。与广泛使用的基线方法(包括fastText、GlotLID和CLD3)的实证评估表明,UniLID在标准基准上实现了竞争力的表现,在低资源环境中显著提高了样本效率——仅需每种语言五个标记样本即可超过70%的准确性——并且在细粒度方言识别上取得了巨大收益。
Mine and Refine: Optimizing Graded Relevance in E-commerce Search Retrieval
Authors: Jiaqi Xi, Raghav Saboo, Luming Chen, Martin Wang, Sudeep Das
First: 2026-02-19T18:56:36+00:00 · Latest: 2026-02-19T18:56:36+00:00
Abstract
We propose a two-stage "Mine and Refine" contrastive training framework for semantic text embeddings to enhance multi-category e-commerce search retrieval. Large scale e-commerce search demands embeddings that generalize to long tail, noisy queries while adhering to scalable supervision compatible with product and policy constraints. A practical challenge is that relevance is often graded: users accept substitutes or complements beyond exact matches, and production systems benefit from clear separation of similarity scores across these relevance strata for stable hybrid blending and thresholding. To obtain scalable policy consistent supervision, we fine-tune a lightweight LLM on human annotations under a three-level relevance guideline and further reduce residual noise via engagement driven auditing. In Stage 1, we train a multilingual Siamese two-tower retriever with a label aware supervised contrastive objective that shapes a robust global semantic space. In Stage 2, we mine hard samples via ANN and re-annotate them with the policy aligned LLM, and introduce a multi-class extension of circle loss that explicitly sharpens similarity boundaries between relevance levels, to further refine and enrich the embedding space. Robustness is additionally improved through additive spelling augmentation and synthetic query generation. Extensive offline evaluations and production A/B tests show that our framework improves retrieval relevance and delivers statistically significant gains in engagement and business impact.
中文标题/摘要
标题:挖掘与精炼:优化电子商务搜索检索的相关性
我们提出了一种两阶段“挖掘与精炼”对比训练框架,以增强语义文本嵌入,提升多类别电子商务搜索检索效果。大规模电子商务搜索需要能够泛化到长尾、噪声查询的嵌入,同时符合与产品和政策约束相兼容的可扩展监督。实际挑战在于相关性往往是分级的:用户接受超出精确匹配的替代品或补充品,生产系统从这些相关性层次的相似度分数清晰分离中受益,以实现稳定的混合混合和阈值设定。为了获得与政策一致的可扩展监督,我们基于三级相关性指南对轻量级LLM进行微调,并通过参与驱动的审计进一步减少残留噪声。在第一阶段,我们使用带有标签感知监督对比目标的多语言双塔检索器进行训练,以塑造稳健的全局语义空间。在第二阶段,我们通过ANN挖掘困难样本,并使用与政策对齐的LLM重新注释它们,引入多类扩展的圈形损失,明确细化不同相关性层次之间的相似度边界,进一步精炼和丰富嵌入空间。通过添加拼写增强和合成查询生成,进一步提高鲁棒性。广泛的离线评估和生产A/B测试表明,我们的框架提高了检索相关性,并在参与度和商业影响方面取得了统计上显著的提升。
Summary / 总结
The paper proposes a two-stage 'Mine and Refine' framework to optimize semantic text embeddings for e-commerce search retrieval. In Stage 1, a multilingual Siamese two-tower retriever is trained with a label-aware supervised contrastive objective to create a robust global semantic space. In Stage 2, hard samples are mined using approximate nearest neighbor (ANN) search and re-annotated with a policy-aligned lightweight language model, and a multi-class circle loss is introduced to sharpen similarity boundaries between relevance levels. Offline and production A/B tests demonstrate improvements in retrieval relevance and significant gains in user engagement and business impact.
论文提出了一种名为'Mine and Refine'的两阶段框架,用于优化电子商务搜索检索中的语义文本嵌入。该框架旨在处理长尾和噪声查询,同时保持可扩展性和策略一致性。在第一阶段,使用带有标签感知的监督对比目标训练一个多语言双塔检索器,以创建一个稳健的全局语义空间。在第二阶段,通过策略对齐的轻量级LLM挖掘困难样本并重新注释,引入多类圈形损失以明确区分不同相关性级别的相似边界。该框架还包含拼写增强和合成查询生成等技术以增强鲁棒性。实验结果表明,该框架在检索相关性方面有所改进,并在参与度和商业影响方面取得了显著的提升。
Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking
Authors: Iskar Deng, Nathalia Xu, Shane Steinert-Threlkeld
First: 2026-02-19T18:56:34+00:00 · Latest: 2026-02-19T18:56:34+00:00
Comments: 15 pages, 7 figures, 7 tables. Under review
Abstract
Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.
中文标题/摘要
标题:语言模型在处理差异性论元标记时的类型学对齐差异
近期研究表明,使用合成语料库训练的语言模型(LMs)可以表现出类似人类语言跨语言规律的类型学偏好,特别是在句法现象如词序方面。本文在此基础上将这一范式扩展到差异性论元标记(DAM),这是一种语义许可系统,在这种系统中,形态标记依赖于语义突出。通过一种受控的合成学习方法,我们使用18个实施不同DAM系统的语料库训练GPT-2模型,并使用最小对进行泛化评估。我们的结果揭示了DAM两种类型学维度之间的分离。模型可靠地表现出人类类似的选择偏好,倾向于自然标记方向,偏好在外显标记目标为语义不典型论元的系统中。相反,模型没有再现人类语言中的强烈宾语偏好,在这种偏好中,DAM中的外显标记更常针对宾语而不是主语。这些发现表明,不同的类型学倾向可能源自不同的潜在来源。
Summary / 总结
This paper investigates how language models trained on synthetic corpora handle differential argument marking (DAM), a semantic licensing system. Using GPT-2 models trained on 18 corpora with distinct DAM systems, the study finds that models exhibit human-like preferences for natural markedness direction but do not replicate the strong object preference observed in human languages. This suggests that different typological tendencies may arise from distinct underlying sources.
研究探讨了语言模型在处理差分论元标记(DAM)时的行为,DAM是一种语义许可系统。通过在18种不同的DAM系统上训练GPT-2模型并使用最小对进行评估,研究发现模型倾向于自然的标记方向,偏好那些在外延语义上不典型的论元上使用显式标记。然而,模型没有复制人类语言中强烈的宾语偏好,即在DAM中显式标记更常针对宾语而非主语。这表明不同的类型学倾向可能源自不同的潜在来源。
Human-level 3D shape perception emerges from multi-view learning
Authors: Tyler Bonnen, Jitendra Malik, Angjoo Kanazawa
First: 2026-02-19T18:56:05+00:00 · Latest: 2026-02-19T18:56:05+00:00
Abstract
Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligence, yet decades of computational methods have fallen short of human performance. Here we develop a modeling framework that predicts human 3D shape inferences for arbitrary objects, directly from experimental stimuli. We achieve this with a novel class of neural networks trained using a visual-spatial objective over naturalistic sensory data; given a set of images taken from different locations within a natural scene, these models learn to predict spatial information related to these images, such as camera location and visual depth, without relying on any object-related inductive biases. Notably, these visual-spatial signals are analogous to sensory cues readily available to humans. We design a zero-shot evaluation approach to determine the performance of these `multi-view' models on a well established 3D perception task, then compare model and human behavior. Our modeling framework is the first to match human accuracy on 3D shape inferences, even without task-specific training or fine-tuning. Remarkably, independent readouts of model responses predict fine-grained measures of human behavior, including error patterns and reaction times, revealing a natural correspondence between model dynamics and human perception. Taken together, our findings indicate that human-level 3D perception can emerge from a simple, scalable learning objective over naturalistic visual-spatial data. All code, human behavioral data, and experimental stimuli needed to reproduce our findings can be found on our project page.
中文标题/摘要
标题:多视角学习中的人类级3D形状感知
人类可以从二维视觉输入中推断出物体的三维结构。模拟这种能力一直是视觉智能科学与工程领域的长期目标,但几十年来,计算方法仍未达到人类的性能。我们开发了一种建模框架,可以直接从实验刺激中预测任意物体的人类3D形状推断。我们使用一种新颖的神经网络类,通过自然感官数据的空间视觉目标进行训练;给定自然场景中不同位置拍摄的一组图像,这些模型能够学习预测与这些图像相关的空间信息,如相机位置和视觉深度,而无需依赖任何与物体相关的归纳偏置。值得注意的是,这些视觉空间信号类似于人类可轻易获得的感官线索。我们设计了一种零样本评估方法来确定这些“多视角”模型在一项成熟的3D感知任务中的性能,然后将模型行为与人类行为进行比较。我们的建模框架是首个在无需特定任务训练或微调的情况下达到人类3D形状推断准确性的框架。令人惊讶的是,模型响应的独立读数可以预测人类行为的细微差异,包括错误模式和反应时间,揭示了模型动态与人类感知之间的自然对应关系。综上所述,我们的研究结果表明,人类级的3D感知可以从自然视觉空间数据上的简单可扩展学习目标中涌现。所有用于重现我们研究结果的代码、人类行为数据和实验刺激都可以在我们的项目页面上找到。
Summary / 总结
This study aims to model human ability to infer 3D object structure from 2D images, which has been a challenge for computational methods. The researchers developed a neural network trained on visual-spatial data from natural scenes, learning to predict spatial information like camera location and depth without object-specific biases. This model matched human accuracy on 3D shape inference tasks and showed a natural correspondence with human perception patterns, including error types and reaction times. The model's performance suggests that human-level 3D perception can emerge from learning over naturalistic visual-spatial data without specific task training.
研究旨在模拟人类从2D图像中推断3D形状的能力,这一直是几十年来的挑战。该研究使用一种新型神经网络,通过自然场景中的多视角图像训练,来预测如相机位置和深度的空间信息。这些模型在3D形状推断任务上达到了与人类相同的准确性,且其响应与人类的错误模式和反应时间相关,表明模型动态与人类感知之间存在自然对应关系。
Multi-Round Human-AI Collaboration with User-Specified Requirements
Authors: Sima Noorani, Shayan Kiyani, Hamed Hassani, George Pappas
First: 2026-02-19T18:54:34+00:00 · Latest: 2026-02-19T18:54:34+00:00
Abstract
As humans increasingly rely on multiround conversational AI for high stakes decisions, principled frameworks are needed to ensure such interactions reliably improve decision quality. We adopt a human centric view governed by two principles: counterfactual harm, ensuring the AI does not undermine human strengths, and complementarity, ensuring it adds value where the human is prone to err. We formalize these concepts via user defined rules, allowing users to specify exactly what harm and complementarity mean for their specific task. We then introduce an online, distribution free algorithm with finite sample guarantees that enforces the user-specified constraints over the collaboration dynamics. We evaluate our framework across two interactive settings: LLM simulated collaboration on a medical diagnostic task and a human crowdsourcing study on a pictorial reasoning task. We show that our online procedure maintains prescribed counterfactual harm and complementarity violation rates even under nonstationary interaction dynamics. Moreover, tightening or loosening these constraints produces predictable shifts in downstream human accuracy, confirming that the two principles serve as practical levers for steering multi-round collaboration toward better decision quality without the need to model or constrain human behavior.
中文标题/摘要
标题:多轮人机协作与用户指定要求
随着人类越来越多地依赖多轮对话AI进行高风险决策,需要有原则性的框架来确保此类交互能够可靠地提高决策质量。我们采取以人为中心的观点,遵循两个原则:反事实伤害,确保AI不削弱人类的优势;互补性,确保AI在人类容易出错的地方增加价值。我们通过用户定义的规则形式化这些概念,允许用户明确指定特定任务中的伤害和互补性含义。然后,我们引入了一个在线的、无分布假设的算法,具有有限样本保证,该算法在协作动态中强制执行用户指定的约束。我们在两个交互设置中评估了我们的框架:模拟大型语言模型在医疗诊断任务上的合作和人类众包研究在图像推理任务上的合作。我们展示了我们的在线程序即使在非平稳交互动态下也能保持规定的反事实伤害和互补性违反率。此外,收紧或放松这些约束会产生可预测的人类下游准确性变化,证实了这两个原则作为实用杠杆的作用,可以引导多轮合作向更好的决策质量发展,而无需建模或约束人类行为。
Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting
Authors: Xiaohan Zhao, Zhaoyi Li, Yaxin Luo, Jiacheng Cui, Zhiqiang Shen
First: 2026-02-19T18:54:32+00:00 · Latest: 2026-02-19T18:54:32+00:00
Comments: Code at: https://github.com/vila-lab/M-Attack-V2
Abstract
Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: https://github.com/vila-lab/M-Attack-V2.
中文标题/摘要
标题:通过细粒度细节目标推动黑盒LVLM攻击前沿
大型视觉-语言模型(LVLMs)的黑盒对抗攻击由于缺乏梯度和复杂的多模态边界而具有挑战性。尽管先前的基于转移的方法,如M-Attack,通过源和目标图像的局部切片级匹配表现良好,但我们发现这会导致梯度在迭代中高度变化且几乎正交,违反了局部一致对齐并破坏了优化。我们将其归因于(i)ViT翻译敏感性导致尖峰梯度和(ii)源和目标切片之间的结构不对称性。我们将局部匹配重新表述为源变换和目标语义的非对称期望,并构建了M-Attack的梯度去噪升级版。在源侧,多切片对齐(MCA)在每次迭代中从多个独立采样的局部视图中平均梯度以减少方差。在目标侧,辅助目标对齐(ATA)用来自语义相关分布的小辅助集替换激进的目标增强,产生更平滑、方差更低的目标流形。我们进一步将动量重新解释为块动量,回放历史切片梯度;结合精细块大小集合(PE+),这加强了可转移方向。这些模块共同构成了M-Attack-V2,这是一个简单的模块化增强,显著提高了前沿LVLM的基于转移的黑盒攻击成功率:将Claude-4.0的成功率从8%提升到30%,Gemini-2.5-Pro从83%提升到97%,GPT-5从98%提升到100%,超越了先前的黑盒LVLM攻击。代码和数据可在:https://github.com/vila-lab/M-Attack-V2公开获取。
FAMOSE: A ReAct Approach to Automated Feature Discovery
Authors: Keith Burghardt, Jienan Liu, Sadman Sakib, Yuning Hao, Bo Li
First: 2026-02-19T18:53:15+00:00 · Latest: 2026-02-19T18:53:15+00:00
Comments: 23 pages, 6 figures
Abstract
Feature engineering remains a critical yet challenging bottleneck in machine learning, particularly for tabular data, as identifying optimal features from an exponentially large feature space traditionally demands substantial domain expertise. To address this challenge, we introduce FAMOSE (Feature AugMentation and Optimal Selection agEnt), a novel framework that leverages the ReAct paradigm to autonomously explore, generate, and refine features while integrating feature selection and evaluation tools within an agent architecture. To our knowledge, FAMOSE represents the first application of an agentic ReAct framework to automated feature engineering, especially for both regression and classification tasks. Extensive experiments demonstrate that FAMOSE is at or near the state-of-the-art on classification tasks (especially tasks with more than 10K instances, where ROC-AUC increases 0.23% on average), and achieves the state-of-the-art for regression tasks by reducing RMSE by 2.0% on average, while remaining more robust to errors than other algorithms. We hypothesize that FAMOSE's strong performance is because ReAct allows the LLM context window to record (via iterative feature discovery and evaluation steps) what features did or did not work. This is similar to a few-shot prompt and guides the LLM to invent better, more innovative features. Our work offers evidence that AI agents are remarkably effective in solving problems that require highly inventive solutions, such as feature engineering.
中文标题/摘要
标题:FAMOSE:一种自动特征发现的ReAct方法
特征工程仍然是机器学习中的一个关键但具有挑战性的瓶颈,尤其是在表格数据中,从指数级大的特征空间中识别出最优特征通常需要大量的领域专业知识。为了解决这一挑战,我们引入了FAMOSE(特征增强和优化选择代理),这是一种新颖的框架,利用ReAct范式自主探索、生成和优化特征,同时在代理架构中集成特征选择和评估工具。据我们所知,FAMOSE是第一个将代理ReAct框架应用于自动特征工程的尝试,特别适用于回归和分类任务。广泛的实验表明,FAMOSE在分类任务(尤其是具有超过10000个实例的任务中,平均ROC-AUC提高了0.23%)中达到了或接近最先进的水平,并通过将平均RMSE降低2.0%在回归任务中达到了最先进的水平,同时比其他算法更具鲁棒性。我们假设FAMOSE的出色表现是因为ReAct允许LLM上下文窗口通过迭代的特征发现和评估步骤记录哪些特征有效或无效。这类似于少量示例提示,指导LLM发明更好的、更具创新性的特征。我们的工作提供了证据,表明AI代理在解决需要高度创新性解决方案的问题(如特征工程)方面非常有效。
Summary / 总结
FAMOSE is a framework that uses the ReAct paradigm to automate feature engineering for tabular data. It explores, generates, and refines features while integrating feature selection and evaluation within an agent architecture. Experiments show that FAMOSE performs at or near the state-of-the-art on classification tasks, particularly for large datasets, and achieves state-of-the-art results for regression tasks by reducing RMSE by 2.0% on average, while being more robust to errors than other algorithms.
FAMOSE 是一个利用 ReAct 帕累托改进自动化特征发现和选择的新框架,特别适用于表格数据。它采用代理架构来探索、生成和优化特征,并集成特征选择和评估工具。实验表明,FAMOSE 在分类任务(尤其是大数据集)上接近或达到最先进的水平,对于回归任务,通过将 RMSE 减少 2.0% 达到最先进的效果,同时比其他算法更具鲁棒性。
IntRec: Intent-based Retrieval with Contrastive Refinement
Authors: Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Yue Lu
First: 2026-02-19T18:50:53+00:00 · Latest: 2026-02-19T18:50:53+00:00
Abstract
Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.
中文标题/摘要
标题:IntRec:基于意图的对比精炼检索
从复杂场景中检索用户指定的对象仍然是一个具有挑战性的任务,尤其是在查询模糊或涉及多个相似对象的情况下。现有的开放式词汇检测器以单次检测的方式运行,缺乏根据用户反馈精炼预测的能力。为了解决这个问题,我们提出了IntRec,这是一种基于用户反馈进行预测精炼的交互式对象检索框架。其核心是一个意图状态(IS),它维护了正锚点(确认的线索)和负约束(被拒绝的假设)的双重记忆集。对比对齐函数通过最大化与正线索的相似性并惩罚被拒绝的对象来对候选对象进行排名,从而在杂乱的场景中实现细粒度的消歧。我们的交互式框架在不增加额外监督的情况下显著提高了检索准确性。在LVIS数据集上,IntRec达到了35.4 AP,分别比OVMR、CoDet和CAKE高出+2.3、+3.7和+0.5。在具有挑战性的LVIS-Ambiguous基准测试中,它在单次纠正反馈后提高了7.9 AP的性能,每次交互的额外延迟少于30毫秒。
Summary / 总结
IntRec is an interactive object retrieval framework that refines predictions based on user feedback, addressing the challenge of ambiguous queries in complex scenes. It uses an Intent State maintaining positive anchors and negative constraints, and a contrastive alignment function to rank candidates. On LVIS, IntRec outperforms existing methods by +2.3 to +3.7 AP, and improves performance by +7.9 AP on the LVIS-Ambiguous benchmark with minimal latency.
IntRec 是一个基于用户反馈的交互式物体检索框架,通过维护正锚和负约束来细化预测,解决复杂场景中模糊查询的挑战。它使用对比对齐函数来排名候选物体。IntRec 在 LVIS 上的 AP 达到 35.4,显著优于现有方法,并在 LVIS-Ambiguous 基准上通过单次反馈提高了 7.9 的 AP。
CORAL: Correspondence Alignment for Improved Virtual Try-On
Authors: Jiyoung Kim, Youngjin Shin, Siyoon Jin, Dahyun Chung, Jisu Nam, Tongmin Kim, Jongjae Park, Hyeonwoo Kang, Seungryong Kim
First: 2026-02-19T18:50:12+00:00 · Latest: 2026-02-19T18:50:12+00:00
Comments: 32 pages, 25 figures
Abstract
Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person-garment correspondence is required. These methods do not explicitly enforce person-garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that the person-garment correspondence critically depends on precise person-garment query-key matching within the full 3D attention. Building on this insight, we then introduce CORrespondence ALignment (CORAL), a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person-garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.
中文标题/摘要
标题:CORAL: 对应对齐以改善虚拟试穿
现有的虚拟试穿(VTON)方法往往难以保留细部服装细节,尤其是在需要准确的人-服装对应关系的非配对设置中。这些方法没有明确地强制执行人-服装对齐,并且无法解释对应关系如何在扩散变换器(DiTs)中出现。在本文中,我们首先分析了基于DiT架构的全3D注意力,并揭示出人-服装对应关系的关键依赖于全3D注意力中精确的人-服装查询-键匹配。基于这一洞察,我们随后引入了CORrespondence ALignment(CORAL),这是一种基于DiT的框架,明确地将查询-键匹配与稳健的外部对应关系对齐。CORAL结合了两个互补的组件:一个对应关系蒸馏损失,将可靠的匹配与人-服装注意力对齐,以及一个熵最小化损失,使注意力分布更加清晰。我们还提出了一种基于VLM的评估协议,以更好地反映人类偏好。CORAL在基准之上始终表现出改进,增强了全局形状转移和局部细节保留。广泛的消融实验验证了我们的设计选择。
Summary / 总结
This paper addresses the challenge of preserving fine garment details in Virtual Try-On (VTON) by introducing CORAL, a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. The method includes a correspondence distillation loss and an entropy minimization loss to enhance attention distribution. Experimental results show that CORAL improves both global shape transfer and local detail preservation over the baseline methods.
本文通过引入CORAL框架,该框架基于DiT并明确对齐查询-键匹配与稳健的外部对应关系,来解决虚拟试穿(VTON)中精细服装细节保留的问题。CORAL包含一个对应关系蒸馏损失和一个熵最小化损失以增强注意力分布。该方法在全局形状转移和局部细节保留方面均优于基线,在大量实验中得到了验证。
When to Trust the Cheap Check: Weak and Strong Verification for Reasoning
Authors: Shayan Kiyani, Sima Noorani, George Pappas, Hamed Hassani
First: 2026-02-19T18:47:38+00:00 · Latest: 2026-02-19T18:47:38+00:00
Abstract
Reasoning with LLMs increasingly unfolds inside a broader verification loop. Internally, systems use cheap checks, such as self-consistency or proxy rewards, which we call weak verification. Externally, users inspect outputs and steer the model through feedback until results are trustworthy, which we call strong verification. These signals differ sharply in cost and reliability: strong verification can establish trust but is resource-intensive, while weak verification is fast and scalable but noisy and imperfect. We formalize this tension through weak--strong verification policies, which decide when to accept or reject based on weak verification and when to defer to strong verification. We introduce metrics capturing incorrect acceptance, incorrect rejection, and strong-verification frequency. Over population, we show that optimal policies admit a two-threshold structure and that calibration and sharpness govern the value of weak verifiers. Building on this, we develop an online algorithm that provably controls acceptance and rejection errors without assumptions on the query stream, the language model, or the weak verifier.
中文标题/摘要
标题:何时信任廉价检查:推理中的弱验证与强验证
使用大语言模型(LLMs)的推理越来越多地嵌入到更广泛的验证循环中。内部,系统使用廉价检查,如自一致性或代理奖励,我们称之为弱验证。外部,用户检查输出并通过反馈引导模型直到结果可信,我们称之为强验证。这些信号在成本和可靠性上存在显著差异:强验证可以建立信任但资源密集,而弱验证快速且可扩展但噪声大且不完美。我们通过弱-强验证策略形式化这种张力,这些策略根据弱验证决定何时接受或拒绝,并在需要时将决策委托给强验证。我们引入了衡量误接受、误拒绝和强验证频率的指标。在总体上,我们展示了最优策略具有两阈值结构,并且校准和锐度决定了弱验证器的价值。在此基础上,我们开发了一种在线算法,该算法在无需假设查询流、语言模型或弱验证器的情况下,能够证明控制接受和拒绝错误。
Summary / 总结
The paper explores the trade-offs between weak and strong verification methods in reasoning with large language models (LLMs). Weak verification, such as self-consistency checks, is fast and scalable but unreliable, while strong verification, involving user feedback, is costly but ensures trust. The authors formalize this through weak-strong verification policies and introduce metrics to evaluate these policies. They show that optimal policies have a two-threshold structure and that calibration and sharpness are crucial for the effectiveness of weak verifiers. An online algorithm is developed to control acceptance and rejection errors without making assumptions about the query stream, language model, or weak verifier.
论文探讨了在使用大型语言模型(LLM)进行推理时,弱验证和强验证之间的权衡。弱验证,如自我一致性检查,速度快且可扩展,但不准确;而强验证,涉及用户反馈,虽然成本高但能建立信任。作者通过弱强验证策略形式化了这一权衡,并引入了评估这些策略的指标。他们表明,最优策略具有两阈值结构,并且校准和锐度对于弱验证器的有效性至关重要。开发了一个在线算法,无需假设查询流、语言模型或弱验证器即可控制接受和拒绝错误。
SMAC: Score-Matched Actor-Critics for Robust Offline-to-Online Transfer
Authors: Nathan S. de Lara, Florian Shkurti
First: 2026-02-19T18:47:31+00:00 · Latest: 2026-02-19T18:47:31+00:00
Abstract
Modern offline Reinforcement Learning (RL) methods find performant actor-critics, however, fine-tuning these actor-critics online with value-based RL algorithms typically causes immediate drops in performance. We provide evidence consistent with the hypothesis that, in the loss landscape, offline maxima for prior algorithms and online maxima are separated by low-performance valleys that gradient-based fine-tuning traverses. Following this, we present Score Matched Actor-Critic (SMAC), an offline RL method designed to learn actor-critics that transition to online value-based RL algorithms with no drop in performance. SMAC avoids valleys between offline and online maxima by regularizing the Q-function during the offline phase to respect a first-order derivative equality between the score of the policy and action-gradient of the Q-function. We experimentally demonstrate that SMAC converges to offline maxima that are connected to better online maxima via paths with monotonically increasing reward found by first-order optimization. SMAC achieves smooth transfer to Soft Actor-Critic and TD3 in 6/6 D4RL tasks. In 4/6 environments, it reduces regret by 34-58% over the best baseline.
中文标题/摘要
标题:SMAC:分数匹配的演员-评论家算法以实现稳健的离线到在线转移
现代离线强化学习(RL)方法能够找到表现良好的演员-评论家,然而,使用基于值的RL算法在线微调这些演员-评论家通常会导致性能立即下降。我们提供了证据支持假设,在损失景观中,先前算法的离线最大值和在线最大值之间被低性能的山谷隔开,基于梯度的微调会穿越这些山谷。基于此,我们提出了分数匹配的演员-评论家(SMAC),这是一种离线RL方法,旨在学习在不降低性能的情况下过渡到在线基于值的RL算法的演员-评论家。SMAC通过在离线阶段正则化Q函数,使其遵守策略得分与Q函数动作梯度的一阶导数相等,从而避免了离线和在线最大值之间的低谷。我们通过一阶优化找到的奖励单调增加的路径实验性地证明了SMAC能够平稳地过渡到Soft Actor-Critic和TD3在6/6个D4RL任务中。在4/6个环境中,它将后悔减少34-58%超过最佳基线。
Summary / 总结
The research aims to address the issue of performance drops when fine-tuning offline-trained actor-critics with online value-based RL algorithms. The method, Score Matched Actor-Critic (SMAC), regularizes the Q-function during the offline phase to ensure a connection between offline and online maxima. Experiments show that SMAC can achieve smooth transfer to Soft Actor-Critic and TD3 in 6 out of 6 D4RL tasks and reduces regret by 34-58% in 4 out of 6 environments compared to the best baseline.
研究旨在解决使用基于值的RL算法在线微调离线训练的actor-critic时性能下降的问题。方法Score Matched Actor-Critic (SMAC) 在离线阶段通过正则化Q函数来确保离线和在线最大值之间的连接,避免低性能的山谷。实验表明,SMAC 在6个D4RL任务中实现了平滑的转移,并且在4个环境中将后悔率降低了34-58%,优于最佳基线。
Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning
Authors: Obaidullah Zaland, Zulfiqar Ahmad Khan, Monowar Bhuyan
First: 2026-02-19T18:44:23+00:00 · Latest: 2026-02-19T18:44:23+00:00
Comments: Accepted for publication in the IEEE International Conference on Big Data (IEEE BigData) 2025
Abstract
Modern big-data systems generate massive, heterogeneous, and geographically dispersed streams that are large-scale and privacy-sensitive, making centralization challenging. While federated learning (FL) provides a privacy-enhancing training mechanism, it assumes a static data flow and learns a collaborative model over multiple rounds, making learning with \textit{incremental} data challenging in limited-communication scenarios. This paper presents One-Shot Incremental Federated Learning (OSI-FL), the first FL framework that addresses the dual challenges of communication overhead and catastrophic forgetting. OSI-FL communicates category-specific embeddings, devised by a frozen vision-language model (VLM) from each client in a single communication round, which a pre-trained diffusion model at the server uses to synthesize new data similar to the client's data distribution. The synthesized samples are used on the server for training. However, two challenges still persist: i) tasks arriving incrementally need to retrain the global model, and ii) as future tasks arrive, retraining the model introduces catastrophic forgetting. To this end, we augment training with Selective Sample Retention (SSR), which identifies and retains the top-p most informative samples per category and task pair based on sample loss. SSR bounds forgetting by ensuring that representative retained samples are incorporated into training in further iterations. The experimental results indicate that OSI-FL outperforms baselines, including traditional and one-shot FL approaches, in both class-incremental and domain-incremental scenarios across three benchmark datasets.
中文标题/摘要
标题:具有灾难性遗忘鲁棒的一次性增量联邦学习
现代大数据系统生成大量、异构且地理上分散的流数据,规模庞大且涉及隐私,使得集中化变得困难。虽然联邦学习(FL)提供了一种增强隐私的训练机制,但它假设静态的数据流,并在多轮中学习协作模型,这使得在通信受限场景中处理增量数据的学习变得具有挑战性。本文提出了一次性增量联邦学习(OSI-FL),这是第一个解决通信开销和灾难性遗忘双重挑战的FL框架。OSI-FL通过在单个通信轮次中由每个客户端的冻结视觉-语言模型(VLM)生成类别特定的嵌入,然后由服务器端的预训练扩散模型合成与客户端数据分布相似的新数据样本,这些合成样本在服务器端用于训练。然而,仍存在两个挑战:i) 任务以增量方式到达需要重新训练全局模型,ii) 随着未来任务的到达,重新训练模型会导致灾难性遗忘。为此,我们通过选择性样本保留(SSR)增强训练,该方法基于样本损失识别并保留每个类别和任务对中最信息丰富的top-p个样本。SSR通过确保代表性保留样本在后续迭代中被纳入训练来限制遗忘。实验结果表明,OSI-FL在三个基准数据集上的类增量和领域增量场景中均优于基线方法,包括传统的和一次性FL方法。
Summary / 总结
This paper addresses the challenges of communication overhead and catastrophic forgetting in federated learning with incremental data. It introduces One-Shot Incremental Federated Learning (OSI-FL), which communicates category-specific embeddings from clients in a single round, and uses a pre-trained diffusion model to synthesize new data. To mitigate catastrophic forgetting, the authors propose Selective Sample Retention (SSR), which retains top-p most informative samples per category and task. Experiments show that OSI-FL outperforms traditional and one-shot FL approaches in both class-incremental and domain-incremental scenarios across three benchmark datasets.
本文解决了增量数据下联邦学习中的通信开销和灾难性遗忘问题。提出了增量联邦学习(OSI-FL),该方法从客户端向服务器发送类别特定的嵌入,服务器利用这些嵌入生成新的数据样本进行训练。为了缓解灾难性遗忘,作者提出了选择性样本保留(SSR)方法,该方法根据样本损失保留每个类别和任务对中最信息丰富的样本。实验结果表明,OSI-FL 在三个基准数据集的类增量和域增量场景中均优于传统和单次联邦学习方法。
Unmasking the Factual-Conceptual Gap in Persian Language Models
Authors: Alireza Sakhaeirad, Ali Ma'manpoosh, Arshia Hemmat
First: 2026-02-19T18:42:46+00:00 · Latest: 2026-02-19T18:42:46+00:00
Abstract
While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DivanBench, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model's ability to discern contradictions; and all models show a 21\% performance gap between retrieving factual knowledge and applying it in scenarios. These findings demonstrate that cultural competence requires more than scaling monolingual data, as current models learn to mimic cultural patterns without internalizing the underlying schemas.
中文标题/摘要
标题:揭示波斯语言模型的事实-概念差距
虽然新兴的波斯自然语言处理基准已经扩展到语用学和礼貌性,但它们很少区分记忆的文化事实和推理关于隐含社会规范的能力。我们引入了DivanBench,这是一个专注于迷信和习俗的诊断基准,这些是任意的、依赖于上下文的规则,难以通过简单的逻辑推理来解决。通过涵盖315个问题的三种任务类型(事实检索、配对场景验证和情境推理),我们评估了七种波斯语言模型,并揭示了三个关键失败:大多数模型表现出严重的顺从偏差,能够识别适当的行为但无法拒绝明显违反的行为;连续的波斯预训练反而放大了这种偏差,而不是提高推理能力,经常削弱模型区分矛盾的能力;所有模型在检索事实知识和将其应用于场景之间表现出21%的性能差距。这些发现表明,文化能力不仅需要扩展单一语言的数据,当前的模型学会模仿文化模式而不内化其背后的架构模式。
Summary / 总结
The research aims to address the gap between factual and conceptual understanding in Persian language models by introducing DivanBench, a benchmark focusing on superstitions and customs. Seven Persian LLMs were evaluated through 315 questions, revealing that most models exhibit acquiescence bias, continuous pretraining does not improve reasoning, and there is a 21% performance gap between factual knowledge retrieval and its application in scenarios. This indicates that current models lack the ability to internalize underlying schemas and reason about implicit social norms beyond simple memorization of cultural facts.
研究旨在通过引入DivanBench基准,关注迷信和习俗,解决波斯语模型在事实与概念理解之间的差距。通过对315个问题的7个波斯语言模型进行评估,发现大多数模型存在顺从偏差,持续预训练并未提升推理能力,且在事实知识检索与场景应用之间存在21%的性能差距。这表明当前模型缺乏将文化模式内化并超越简单记忆进行隐含社会规范推理的能力。
Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs
Authors: Luke Huang, Zhuoyang Zhang, Qinghao Hu, Shang Yang, Song Han
First: 2026-02-19T18:40:51+00:00 · Latest: 2026-02-19T18:40:51+00:00
Abstract
Reinforcement learning (RL) is widely used to improve large language models on reasoning tasks, and asynchronous RL training is attractive because it increases end-to-end throughput. However, for widely adopted critic-free policy-gradient methods such as REINFORCE and GRPO, high asynchrony makes the policy-gradient estimator markedly $\textbf{higher variance}$: training on stale rollouts creates heavy-tailed importance ratios, causing a small fraction of samples to dominate updates. This amplification makes gradients noisy and learning unstable relative to matched on-policy training. Across math and general reasoning benchmarks, we find collapse is reliably predicted by effective sample size (ESS) and unstable gradient norms. Motivated by this diagnosis, we propose $\textbf{V}$ariance $\textbf{C}$ontrolled $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{VCPO}$), a general stabilization method for REINFORCE/GRPO-style algorithms that (i) scales learning rate based on effective sample size to dampen unreliable updates, and (ii) applies a closed-form minimum-variance baseline for the off-policy setting, avoiding an auxiliary value model and adding minimal overhead. Empirically, VCPO substantially improves robustness for asynchronous training across math, general reasoning, and tool-use tasks, outperforming a broad suite of baselines spanning masking/clipping stabilizers and algorithmic variants. This reduces long-context, multi-turn training time by 2.5$\times$ while matching synchronous performance, demonstrating that explicit control of policy-gradient variance is key for reliable asynchronous RL at scale.
Summary / 总结
The paper addresses the issue of high variance in asynchronous reinforcement learning (RL) training for large language models (LLMs), particularly for critic-free methods like REINFORCE and GRPO. It proposes VCPO, a variance-controlled policy optimization method that scales learning rates based on effective sample size and uses a closed-form minimum-variance baseline to stabilize updates. Experiments show that VCPO significantly improves robustness in asynchronous training across various reasoning tasks, reducing training time by 2.5 times while matching synchronous performance.
论文解决了异步强化学习(RL)训练大型语言模型(LLM)时,政策梯度估计器高方差导致学习不稳定的问题。提出了一种名为VCPO的方法,该方法基于有效样本大小调整学习率,并应用闭式最小方差基线以减少方差。实验表明,VCPO在各种推理任务中的异步训练中提高了鲁棒性,优于其他稳定器,并且与同步性能相当,从而将长上下文、多轮训练时间减少了2.5倍。
Contrastive Diffusion Alignment: Learning Structured Latents for Controllable Generation
Authors: Ruchi Sandilya, Sumaira Perez, Charles Lynch, Lindsay Victoria, Benjamin Zebley, Derrick Matthew Buchanan, Mahendra T. Bhati, Nolan Williams, Timothy J. Spellman, Faith M. Gunning, Conor Liston, Logan Grosenick
First: 2025-10-16T00:48:05+00:00 · Latest: 2026-02-19T18:33:22+00:00
Abstract
Diffusion models excel at generation, but their latent spaces are high dimensional and not explicitly organized for interpretation or control. We introduce ConDA (Contrastive Diffusion Alignment), a plug-and-play geometry layer that applies contrastive learning to pretrained diffusion latents using auxiliary variables (e.g., time, stimulation parameters, facial action units). ConDA learns a low-dimensional embedding whose directions align with underlying dynamical factors, consistent with recent contrastive learning results on structured and disentangled representations. In this embedding, simple nonlinear trajectories support smooth interpolation, extrapolation, and counterfactual editing while rendering remains in the original diffusion space. ConDA separates editing and rendering by lifting embedding trajectories back to diffusion latents with a neighborhood-preserving kNN decoder and is robust across inversion solvers. Across fluid dynamics, neural calcium imaging, therapeutic neurostimulation, facial expression dynamics, and monkey motor cortex activity, ConDA yields more interpretable and controllable latent structure than linear traversals and conditioning-based baselines, indicating that diffusion latents encode dynamics-relevant structure that can be exploited by an explicit contrastive geometry layer.
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
Authors: Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko
Venue: NeurIPS 2025
First: 2025-05-05T17:47:42+00:00 · Latest: 2026-02-19T18:32:53+00:00
Comments: This work was accepted and presented at NeurIPS 2025. Code is available at https://github.com/mts-ai/replaceme Reviews at OpenReview: https://openreview.net/forum?id=zEj1FSYCRn NeurIPS 2025 Proceedings: https://openreview.net/pdf?id=zEj1FSYCRn
Abstract
We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25\% pruning while retaining approximately 90\% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead. We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe
中文标题/摘要
标题:ReplaceMe:通过深度剪枝和Transformer块线性化简化网络
我们引入了ReplaceMe,这是一种通用的无需训练的深度剪枝方法,能够有效用线性操作替换Transformer块,同时在低压缩比下保持高性能。与需要额外训练或微调的传统剪枝方法不同,我们的方法仅需一个小规模的校准数据集来估计线性变换,该变换近似于剪枝后的块。估计出的线性映射可以无缝地与剩余的Transformer块合并,无需任何额外的网络参数。我们的实验表明,ReplaceMe在所有无需训练的方法中表现最佳,并且在涉及大量重新训练/微调和架构修改的最新剪枝方法中保持了高度竞争力。应用于多个大型语言模型(LLMs),ReplaceMe在开放基准测试中实现了高达25%的剪枝,同时保留了原始模型约90%的性能,无需任何训练或修复步骤,从而减少了计算开销。我们提供了一个开源库,实现了ReplaceMe以及几种最先进的深度剪枝技术,可在https://github.com/mts-ai/ReplaceMe 获取。
Summary / 总结
ReplaceMe is a training-free depth pruning method that replaces transformer blocks with linear operations, maintaining high performance with low compression ratios. Unlike conventional pruning methods that require additional training, ReplaceMe uses a small calibration dataset to estimate a linear transformation that approximates pruned blocks. Experiments show ReplaceMe outperforms other training-free approaches and remains competitive with state-of-the-art pruning methods. Applied to large language models, ReplaceMe achieves up to 25% pruning while retaining 90% of original performance, with minimal computational overhead and no training or healing steps needed.
ReplaceMe 是一种无需训练的深度剪枝方法,通过将变压器块替换为线性操作来保持高性能和低压缩比。与需要额外训练的常规剪枝方法不同,ReplaceMe 使用一个小的校准数据集来估计一个线性变换,该变换近似于被剪枝的块。实验表明,ReplaceMe 在无需训练或修复步骤的情况下,优于其他无需训练的剪枝方法,并且与涉及大量重新训练/微调和架构修改的最新剪枝方法保持竞争力。应用于大型语言模型时,ReplaceMe 可以实现高达 25% 的剪枝,同时保留原始模型约 90% 的性能,并且具有最小的计算开销。
Towards Anytime-Valid Statistical Watermarking
Authors: Baihe Huang, Eric Xu, Kannan Ramchandran, Jiantao Jiao, Michael I. Jordan
First: 2026-02-19T18:32:26+00:00 · Latest: 2026-02-19T18:32:26+00:00
Abstract
The proliferation of Large Language Models (LLMs) necessitates efficient mechanisms to distinguish machine-generated content from human text. While statistical watermarking has emerged as a promising solution, existing methods suffer from two critical limitations: the lack of a principled approach for selecting sampling distributions and the reliance on fixed-horizon hypothesis testing, which precludes valid early stopping. In this paper, we bridge this gap by developing the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference. Unlike traditional approaches where optional stopping invalidates Type-I error guarantees, our framework enables valid, anytime-inference by constructing a test supermartingale for the detection process. By leveraging an anchor distribution to approximate the target model, we characterize the optimal e-value with respect to the worst-case log-growth rate and derive the optimal expected stopping time. Our theoretical claims are substantiated by simulations and evaluations on established benchmarks, showing that our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.
中文标题/摘要
标题:迈向任意时点有效的统计水印
大型语言模型(LLMs)的普及需要有效的机制来区分机器生成的内容和人类文本。虽然统计水印已经作为一种有前景的解决方案出现,但现有方法存在两个关键限制:缺乏选择抽样分布的原理性方法以及依赖固定时间窗假设检验,这限制了早期停止的有效性。在本文中,我们通过开发第一个基于e值的水印框架——锚定e水印,填补了这一空白,该框架将最优采样与任意时点有效推理统一起来。与传统方法不同,传统方法中的可选停止会破坏第一类错误保证,而我们的框架通过为检测过程构建测试超鞅,实现了有效的任意时点推理。通过利用锚定分布来近似目标模型,我们以最坏情况下的对数增长率为基准来表征最优e值,并推导出最优的预期停止时间。我们的理论主张通过模拟和在现有基准上的评估得到了证实,表明我们的框架可以显著提高样本效率,与最先进的基线相比,检测所需的平均令牌预算减少了13-15%。
Summary / 总结
This paper addresses the challenge of distinguishing machine-generated content from human text using statistical watermarking. It introduces Anchored E-Watermarking, a novel framework that combines optimal sampling with anytime-valid inference, overcoming the limitations of existing methods. The framework uses an anchor distribution to approximate the target model and constructs a test supermartingale to enable valid early stopping. Experimental results demonstrate that this approach can reduce the average token budget required for detection by 13-15% compared to state-of-the-art methods.
本文旨在通过统计水印技术区分机器生成的内容和人类文本。它引入了锚定E水印框架,该框架结合了最优采样和随时有效的推断,克服了现有方法的局限性。该框架利用锚定分布来近似目标模型,并构建了一个测试超鞅以实现有效的早期停止。模拟和基准评估表明,这种方法可以将检测所需的平均令牌预算减少13-15%,相比最先进的方法。
AutoNumerics: An Autonomous, PDE-Agnostic Multi-Agent Pipeline for Scientific Computing
Authors: Jianda Du, Youran Sun, Haizhao Yang
First: 2026-02-19T18:31:52+00:00 · Latest: 2026-02-19T18:31:52+00:00
Abstract
PDEs are central to scientific and engineering modeling, yet designing accurate numerical solvers typically requires substantial mathematical expertise and manual tuning. Recent neural network-based approaches improve flexibility but often demand high computational cost and suffer from limited interpretability. We introduce \texttt{AutoNumerics}, a multi-agent framework that autonomously designs, implements, debugs, and verifies numerical solvers for general PDEs directly from natural language descriptions. Unlike black-box neural solvers, our framework generates transparent solvers grounded in classical numerical analysis. We introduce a coarse-to-fine execution strategy and a residual-based self-verification mechanism. Experiments on 24 canonical and real-world PDE problems demonstrate that \texttt{AutoNumerics} achieves competitive or superior accuracy compared to existing neural and LLM-based baselines, and correctly selects numerical schemes based on PDE structural properties, suggesting its viability as an accessible paradigm for automated PDE solving.
中文标题/摘要
标题:AutoNumerics:一种自主的、与偏微分方程无关的多智能体科学计算管道
偏微分方程(PDEs)在科学和工程建模中至关重要,但设计精确的数值求解器通常需要大量的数学专业知识和手动调整。最近基于神经网络的方法提高了灵活性,但往往需要高计算成本,并且缺乏可解释性。我们介绍了AutoNumerics,这是一种多智能体框架,可以从自然语言描述中自主设计、实现、调试和验证适用于通用PDEs的数值求解器。与黑盒神经求解器不同,我们的框架生成基于经典数值分析的透明求解器。我们引入了一种粗到细的执行策略和基于残差的自我验证机制。在24个经典和实际PDE问题上的实验表明,AutoNumerics在与现有神经网络和基于LLM的基线相比时,实现了竞争力或更优的准确性,并根据PDE结构特性正确选择了数值方案,这表明它作为一种易于访问的自动化PDE求解范式的可行性。
Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery
Authors: Jowaria Khan, Anindya Sarkar, Yevgeniy Vorobeychik, Elizabeth Bondi-Kelly
First: 2026-02-19T18:30:18+00:00 · Latest: 2026-02-19T18:30:18+00:00
Abstract
In many real-world settings, such as environmental monitoring, disaster response, or public health, with costly and difficult data collection and dynamic environments, strategically sampling from unobserved regions is essential for efficiently uncovering hidden targets under tight resource constraints. Yet, sparse and biased geospatial ground truth limits the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of *concept relevance*, which captures how domain-specific factors influence target presence: a *concept-weighted uncertainty sampling strategy*, where uncertainty is modulated by learned relevance based on readily-available domain-specific concepts (e.g., land cover, source proximity); and a *relevance-aware meta-batch formation strategy* that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. Our experiments include testing on a real-world dataset of cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, showcasing our method's reliability at uncovering targets with limited data and a varying environment.
中文标题/摘要
标题:动态适应:基于相关性的在线元学习与潜在概念引导的空间发现
在环境监测、灾害响应或公共卫生等许多现实场景中,由于数据收集成本高且环境动态变化,从未观察区域有选择地采样对于在资源受限的情况下高效发现隐藏目标至关重要。然而,稀疏且有偏的空间真实情况限制了现有基于学习的方法(如强化学习)的应用。为解决这一问题,我们提出了一种统一的空间发现框架,该框架结合了主动学习、在线元学习和概念引导推理。我们的方法引入了两个关键创新,基于*概念相关性*这一共同概念:一种*概念加权不确定性采样策略*,其中不确定性根据基于现成领域特定概念(如土地覆盖、源距离)学习到的相关性进行调整;以及一种*相关性感知的元批次形成策略*,该策略在在线元更新过程中促进语义多样性,从而在动态环境中提高泛化能力。我们的实验包括在真实世界数据集(含致癌的PFAS(全氟和多氟烷基物质)污染)上的测试,展示了在有限数据和变化环境中,该方法在发现目标方面的可靠性。
Summary / 总结
The paper proposes a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning to efficiently uncover hidden targets in dynamic environments with limited data. It introduces a concept-weighted uncertainty sampling strategy and a relevance-aware meta-batch formation strategy, both leveraging domain-specific concepts to improve target discovery. The method was tested on a real-world dataset of PFAS contamination, demonstrating its effectiveness in uncovering targets under resource constraints.
论文提出了一种结合主动学习、在线元学习和概念引导推理的统一地理空间发现框架,以在动态环境中高效地发现隐藏目标,同时数据有限。该方法引入了基于领域特定概念的概念加权不确定性采样策略和相关性感知元批次形成策略,以提高目标发现效果。该方法在PFAS污染的真实世界数据集上进行了测试,展示了其在资源受限条件下发现目标的有效性。
Boosting Medical Visual Understanding From Multi-Granular Language Learning
Authors: Zihan Li, Yiqing Wang, Sina Farsiu, Paul Kinahan
Venue: ICLR 2026
First: 2025-11-20T00:24:26+00:00 · Latest: 2026-02-19T18:27:29+00:00
Comments: Accepted by ICLR 2026. 40 pages
Abstract
Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at https://github.com/HUANGLIZI/MGLL.
中文标题/摘要
标题:从多粒度语言学习增强医学视觉理解
近期在图像-文本预训练方面的进展显著提升了视觉理解能力,通过视觉和文本表示的对齐。对比语言-图像预训练(CLIP)在多模态学习中发挥了关键作用。然而,其对单标签、单粒度对齐的侧重限制了其在医学成像等复杂领域中的有效性,因为医学图像往往对应多个高级标签(例如,疾病类别),且不同注释粒度(例如,诊断描述、临床解释)不同。为解决这一问题,我们提出了多粒度语言学习(MGLL),这是一种对比学习框架,旨在提高多标签和跨粒度对齐。MGLL 利用结构化的多标签监督,整合不同粒度的文本描述,并引入软标签监督和点对点约束以增强对齐。MGLL 使用平滑的Kullback-Leibler(KL)散度确保跨粒度一致性,同时保持计算效率作为视觉-语言模型的即插即用模块。在我们构建的大规模多粒度数据集上预训练,并在多个数据集上进行评估,MGLL 在下游任务中优于其他最先进的方法。代码可在 https://github.com/HUANGLIZI/MGLL/ 获取。
Summary / 总结
The research aims to improve medical visual understanding by addressing the limitations of existing single-granularity alignment methods. It introduces Multi-Granular Language Learning (MGLL), a contrastive learning framework that enhances multi-label and cross-granularity alignment through structured multi-label supervision, integrated textual descriptions, and soft-label supervision. MGLL outperforms other state-of-the-art methods in downstream tasks when pretrained on large-scale multi-granular datasets.
研究旨在通过解决现有单一粒度对齐方法的局限性,提高医学视觉理解。提出的多粒度语言学习(MGLL)框架通过结构化的多标签监督、跨粒度的文本描述整合以及带有点对点约束的软标签监督,增强了多标签和跨粒度对齐。MGLL在大规模多粒度数据集上预训练后,在下游任务中优于其他最先进的方法。
Art2Mus: Artwork-to-Music Generation via Visual Conditioning and Large-Scale Cross-Modal Alignment
Authors: Ivan Rinaldi, Matteo Mendula, Nicola Fanelli, Florence Levé, Matteo Testi, Giovanna Castellano, Gennaro Vessio
First: 2026-02-19T18:23:58+00:00 · Latest: 2026-02-19T18:23:58+00:00
Abstract
Music generation has advanced markedly through multimodal deep learning, enabling models to synthesize audio from text and, more recently, from images. However, existing image-conditioned systems suffer from two fundamental limitations: (i) they are typically trained on natural photographs, limiting their ability to capture the richer semantic, stylistic, and cultural content of artworks; and (ii) most rely on an image-to-text conversion stage, using language as a semantic shortcut that simplifies conditioning but prevents direct visual-to-audio learning. Motivated by these gaps, we introduce ArtSound, a large-scale multimodal dataset of 105,884 artwork-music pairs enriched with dual-modality captions, obtained by extending ArtGraph and the Free Music Archive. We further propose ArtToMus, the first framework explicitly designed for direct artwork-to-music generation, which maps digitized artworks to music without image-to-text translation or language-based semantic supervision. The framework projects visual embeddings into the conditioning space of a latent diffusion model, enabling music synthesis guided solely by visual information. Experimental results show that ArtToMus generates musically coherent and stylistically consistent outputs that reflect salient visual cues of the source artworks. While absolute alignment scores remain lower than those of text-conditioned systems-as expected given the substantially increased difficulty of removing linguistic supervision-ArtToMus achieves competitive perceptual quality and meaningful cross-modal correspondence. This work establishes direct visual-to-music generation as a distinct and challenging research direction, and provides resources that support applications in multimedia art, cultural heritage, and AI-assisted creative practice. Code and dataset will be publicly released upon acceptance.
中文标题/摘要
标题:Art2Mus:通过视觉条件和大规模跨模态对齐进行艺术作品到音乐的生成
通过多模态深度学习,音乐生成取得了显著进展,使模型能够从文本和最近的图像中合成音频。然而,现有的基于图像的系统存在两个根本性限制:(i)它们通常在自然照片上进行训练,限制了它们捕捉艺术作品中更丰富的语义、风格和文化内容的能力;(ii)大多数系统依赖于图像到文本的转换阶段,使用语言作为语义捷径,简化了条件设定,但阻止了直接的视觉到音频学习。受这些差距的启发,我们引入了ArtSound,这是一个包含105,884个艺术作品-音乐配对的大规模多模态数据集,这些配对通过扩展ArtGraph和免费音乐档案馆而丰富了双模态描述。我们进一步提出了ArtToMus,这是第一个明确设计用于直接艺术作品到音乐生成的框架,该框架将数字化的艺术作品映射到音乐中,而无需进行图像到文本的转换或基于语言的语义监督。该框架将视觉嵌入投影到潜在扩散模型的条件空间中,使音乐合成仅由视觉信息引导。实验结果表明,ArtToMus生成了音乐上连贯且风格上一致的输出,反映了源艺术作品的显著视觉线索。虽然绝对对齐分数低于基于文本条件的系统(如预期的那样,由于去除了语言监督的难度显著增加),但ArtToMus在感知质量上达到了竞争力,并实现了有意义的跨模态对应。这项工作确立了直接视觉到音乐生成作为一种独特且具有挑战性的研究方向,并提供了支持多媒体艺术、文化遗产和AI辅助创意实践的应用资源。代码和数据集将在接受后公开发布。
Summary / 总结
The research aims to address the limitations of existing image-conditioned music generation systems by introducing ArtSound, a large multimodal dataset, and ArtToMus, a framework for direct artwork-to-music generation. ArtToMus maps digitized artworks to music without an image-to-text conversion stage, using visual embeddings in a latent diffusion model. The results show that ArtToMus generates musically coherent and stylistically consistent outputs that reflect visual cues from the source artworks, achieving competitive perceptual quality and meaningful cross-modal correspondence despite the increased difficulty of removing linguistic supervision.
研究旨在通过引入ArtSound大规模多模态数据集和ArtToMus框架,解决现有基于图像的音乐生成系统的局限性。ArtToMus直接将数字化的艺术作品映射为音乐,不经过图像到文本的转换,而是使用视觉嵌入在潜在扩散模型中的条件空间。结果表明,ArtToMus生成了音乐连贯且风格一致的输出,反映了源艺术作品的视觉线索,尽管与基于文本的系统相比,绝对对齐分数较低。这项工作突显了直接视觉到音乐生成在多媒体艺术和文化遗产应用中的潜力。
The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?
Authors: Jayadev Billa
First: 2026-02-19T18:22:39+00:00 · Latest: 2026-02-19T18:22:39+00:00
Comments: 10 pages, 6 figures, 7 tables
Abstract
Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. Ultravox is statistically indistinguishable from its matched cascade ($κ{=}0.93$); logit lens reveals literal text emerging in hidden states; LEACE concept erasure confirms text representations are causally necessary in both architectures tested, collapsing accuracy to near-zero. Qwen2-Audio genuinely diverges, revealing cascade equivalence is architecture-dependent, not universal. For most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0 dB.
中文标题/摘要
标题:级联等效假设:当语音LLM行为像ASR$\rightarrow$LLM流水线时?
当前的语音LLM主要执行隐式的ASR:在可以从转录解决的任务中,它们在行为和机制上与简单的Whisper$\to$LLM级联相当。我们通过在四个语音LLM和六个任务上进行匹配主干测试,首次控制了LLM主干,展示了这一点。Ultravox在统计上与匹配的级联等效($κ{=}0.93$);logit透镜揭示了隐藏状态中出现的字面文本;LEACE概念擦除证实了在测试的两种架构中,文本表示是因果必需的,导致准确率接近于零。Qwen2-Audio真正地有所不同,揭示了级联等效性依赖于架构,而非普遍适用。对于大多数部署的应用场景,当前的语音LLM是昂贵的级联,而在噪声条件下,它们是更差的级联,干净条件下的优势最多可逆转7.6%至0 dB。
Summary / 总结
The study investigates when speech LLMs behave similarly to ASR→LLM pipelines by testing four speech LLMs across six tasks using matched-backbone testing. Key findings include Ultravox being statistically indistinguishable from its cascade, logit lens showing text representations in hidden states, and LEACE concept erasure confirming text representations are necessary in both architectures. Qwen2-Audio diverges, suggesting cascade equivalence is not universal. The research highlights that current speech LLMs are often expensive cascades and perform worse under noise, with clean-condition advantages reversing by up to 7.6% at 0 dB.
研究通过四项语音LLM在六个任务上的匹配主干测试,考察其何时表现出ASR→LLM流水线的行为。Ultravox与其流水线在统计上无法区分,logit镜头分析显示隐藏层中有文本表示。LEACE概念擦除证实文本表示在测试的两种架构中是因果必需的,而Qwen2-Audio表现出差异,表明流水线等价性并非普遍适用。在噪声条件下,当前的语音LLM表现不如简单的流水线,干净条件下的优势在0 dB时最多可逆转7.6%。
CT-Bench: A Benchmark for Multimodal Lesion Understanding in Computed Tomography
Authors: Qingqing Zhu, Qiao Jin, Tejas S. Mathai, Yin Fang, Zhizheng Wang, Yifan Yang, Maame Sarfo-Gyamfi, Benjamin Hou, Ran Gu, Praveen T. S. Balamuralikrishna, Kenneth C. Wang, Ronald M. Summers, Zhiyong Lu
First: 2026-02-16T16:10:19+00:00 · Latest: 2026-02-19T18:19:25+00:00
Abstract
Artificial intelligence (AI) can automatically delineate lesions on computed tomography (CT) and generate radiology report content, yet progress is limited by the scarcity of publicly available CT datasets with lesion-level annotations. To bridge this gap, we introduce CT-Bench, a first-of-its-kind benchmark dataset comprising two components: a Lesion Image and Metadata Set containing 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information, and a multitask visual question answering benchmark with 2,850 QA pairs covering lesion localization, description, size estimation, and attribute categorization. Hard negative examples are included to reflect real-world diagnostic challenges. We evaluate multiple state-of-the-art multimodal models, including vision-language and medical CLIP variants, by comparing their performance to radiologist assessments, demonstrating the value of CT-Bench as a comprehensive benchmark for lesion analysis. Moreover, fine-tuning models on the Lesion Image and Metadata Set yields significant performance gains across both components, underscoring the clinical utility of CT-Bench.
中文标题/摘要
标题:CT-Bench:计算机断层扫描中多模态病变理解的基准数据集
人工智能(AI)可以自动勾画计算机断层扫描(CT)中的病变并生成放射学报告内容,但进展受限于可用的带有病变级别注释的CT数据集稀缺。为解决这一问题,我们引入了CT-Bench,这是一个首创的基准数据集,包含两个部分:包含7,795份CT研究中20,335个病变的病变图像和元数据集,其中包含边界框、描述和尺寸信息,以及涵盖病变定位、描述、尺寸估计和属性分类的多任务视觉问答基准,包含2,850个问答对。还包含困难的负例以反映实际诊断挑战。我们通过将多个最先进的多模态模型与放射科医生评估进行比较,评估了CT-Bench的价值,证明了CT-Bench作为病变分析综合基准的价值。此外,对病变图像和元数据集进行微调在两个部分上均取得了显著的性能提升,突显了CT-Bench的临床用途。
Summary / 总结
CT-Bench is a new benchmark dataset for lesion understanding in computed tomography, consisting of 20,335 lesions from 7,795 CT studies with bounding boxes, descriptions, and size information. It includes a multitask visual question answering benchmark with 2,850 QA pairs. The dataset evaluates state-of-the-art multimodal models, showing that fine-tuning on the Lesion Image and Metadata Set improves performance, highlighting the clinical utility of CT-Bench for lesion analysis.
CT-Bench 是一个包含 20,335 个病变的基准数据集,来自 7,795 个 CT 研究,包含边界框、描述和大小信息。它还包含一个包含 2,850 个问答对的多任务视觉问答基准。研究通过评估最先进的多模态模型,展示了 CT-Bench 在病变分析中的评估价值。对病变图像和元数据集进行微调可以显著提高两个组件的性能,突显了 CT-Bench 的临床实用性。
AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games
Authors: Lance Ying, Ryan Truong, Prafull Sharma, Kaiya Ivy Zhao, Nathan Cloos, Kelsey R. Allen, Thomas L. Griffiths, Katherine M. Collins, José Hernández-Orallo, Phillip Isola, Samuel J. Gershman, Joshua B. Tenenbaum
First: 2026-02-19T18:17:25+00:00 · Latest: 2026-02-19T18:17:25+00:00
Comments: 29 pages, 14 figures
Abstract
Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play \textbf{all conceivable human games}, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10\% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.
中文标题/摘要
标题:AI游戏库:通过人类游戏评估机器通用智能的可扩展、开放性方法
在技术飞速发展的时代,严格评估机器智能与人类通用智能的广泛谱系相比变得越来越重要且具有挑战性。传统的AI基准测试通常仅评估人类活动有限范围内的狭窄能力。大多数基准测试也是静态的,随着开发人员显式或隐式地对其进行优化,它们很快就会饱和。我们提出了一种更可行的方法来评估AI系统中的人类般通用智能:通过一种特别强大的通用游戏玩法形式:研究它们如何以及如何很好地玩和学习玩所有可能的人类游戏,与具有相同经验水平、时间或其他资源的人类玩家进行比较。我们定义“人类游戏”为人类设计供人类玩的游戏,并认为这个所有此类游戏的空间——“人类游戏多元宇宙”——是评估的合适空间。为了实现这一愿景的第一步,我们引入了AI游戏库,这是一个使用人类在环的LLM构建的可扩展和开放性平台,通过自动获取和适应来自流行的人类数字游戏平台的标准和容器化游戏环境变体来合成新的代表性人类游戏。作为概念验证,我们基于Apple App Store和Steam的热门排行榜生成了100个此类游戏,并在短游戏片段上评估了七个前沿的视觉-语言模型(VLMs)。最好的模型在大多数游戏中的人类平均得分中仅达到了不到10%,尤其是在挑战世界模型学习、记忆和规划的游戏方面表现尤为困难。最后,我们提出了构建AI游戏库的下一步,作为一种实际的方法来衡量和推动机器向人类般通用智能的进步。
Summary / 总结
The research aims to evaluate machine intelligence comprehensively by comparing it to human general intelligence through a wide range of games designed by humans. The method involves using large language models with human oversight to create new representative games from popular digital gaming platforms. Key findings show that the best models achieved less than 10% of the human average score in most games, particularly struggling with games that test world-model learning, memory, and planning abilities.
论文旨在通过一个名为AI GameStore的可扩展和开放平台,将机器智能与人类一般智能进行严格比较。该平台使用大型语言模型和人类监督来创建来自流行数字游戏环境的新代表性人类游戏。研究评估了七个先进的视觉语言模型在短游戏片段上的表现,发现最佳模型在大多数游戏中的得分不到人类平均分的10%,尤其是在挑战世界模型学习、记忆和规划的游戏方面表现尤为糟糕。
Supervised Graph Contrastive Learning for Gene Regulatory Networks
Authors: Sho Oshima, Yuji Okamoto, Taisei Tosaki, Ryosuke Kojima
First: 2025-05-23T11:59:35+00:00 · Latest: 2026-02-19T18:13:50+00:00
Comments: Preprint
Abstract
Graph Contrastive Learning (GCL) is a powerful self-supervised learning framework that performs data augmentation through graph perturbations, with growing applications in the analysis of biological networks such as Gene Regulatory Networks (GRNs). The artificial perturbations commonly used in GCL, such as node dropping, induce structural changes that can diverge from biological reality. This concern has contributed to a broader trend in graph representation learning toward augmentation-free methods, which view such structural changes as problematic and should be avoided. However, this trend overlooks the fundamental insight that structural changes from biologically meaningful perturbations are not a problem to be avoided, but rather a rich source of information, thereby ignoring the valuable opportunity to leverage data from real biological experiments. Motivated by this insight, we propose SupGCL (Supervised Graph Contrastive Learning), a new GCL method for GRNs that directly incorporates biological perturbations from gene knockdown experiments as supervision. SupGCL is a probabilistic formulation that continuously generalizes conventional GCL, linking artificial augmentations with real perturbations measured in knockdown experiments, and using the latter as explicit supervision. On patient-derived GRNs from three cancer types, we train GRN representations with SupGCL and evaluate it in two regimes: (i) embedding space analysis, where it yields clearer disease-subtype structure and improves clustering, and (ii) task-specific fine-tuning, where it consistently outperforms strong graph representation learning baselines on 13 downstream tasks spanning gene-level functional annotation and patient-level prediction.
中文标题/摘要
标题:监督图对比学习在基因调控网络中的应用
图对比学习(GCL)是一种强大的自监督学习框架,通过图扰动进行数据增强,在生物网络分析中,如基因调控网络(GRNs)的分析中应用日益广泛。GCL中常用的扰动,如节点删除,会引入与生物学现实不符的结构变化,这导致了图表示学习中去增强方法的广泛趋势,认为这些结构变化是问题,应该避免。然而,这一趋势忽视了这样一个基本洞察:来自生物学意义扰动的结构变化并不是需要避免的问题,而是丰富信息的来源,从而忽视了利用真实生物实验数据的机会。受此洞察的启发,我们提出了SupGCL(监督图对比学习),这是一种新的GCL方法,用于GRNs,直接将基因敲低实验中的生物学扰动作为监督。SupGCL是一种概率公式,连续泛化传统的GCL,将人工增强与敲低实验中测量的真实扰动联系起来,并使用后者作为明确的监督。在三种癌症类型的患者衍生GRNs中,我们使用SupGCL训练GRN表示,并在两种模式下进行评估:(i)嵌入空间分析,其中它提供了更清晰的疾病亚型结构并改善了聚类;(ii)特定任务微调,在13个下游任务中,包括基因功能注释和患者水平预测,它始终优于强大的图表示学习基线。
Summary / 总结
Motivated by the need to incorporate biologically meaningful perturbations in graph contrastive learning for Gene Regulatory Networks (GRNs), the paper proposes SupGCL, a supervised graph contrastive learning method. SupGCL uses gene knockdown experiment data as explicit supervision, linking artificial augmentations with real perturbations. On patient-derived GRNs from three cancer types, SupGCL improves clustering and outperforms strong graph representation learning baselines in 13 downstream tasks, including gene-level functional annotation and patient-level prediction.
该论文旨在通过基因敲低实验中的生物意义扰动来改进图对比学习方法,以用于基因调控网络(GRNs)。提出了SupGCL方法,它将人工增强与实际扰动联系起来,作为显式的监督信息。在三个癌症类型的患者衍生GRNs上,SupGCL在聚类和13个下游任务上的表现优于现有图表示学习基线方法。
Modeling Distinct Human Interaction in Web Agents
Authors: Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou, Frank Xu, Shuyan Zhou, Graham Neubig, Jeffrey P. Bigham
First: 2026-02-19T18:11:28+00:00 · Latest: 2026-02-19T18:11:28+00:00
Comments: Preprint
Abstract
Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real-user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents -- hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4-63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 26.5% increase in user-rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.
中文标题/摘要
标题:建模人类在网页代理中的独特交互
尽管自主网页代理取得了快速进展,但人类的参与对于塑造偏好和纠正代理行为仍然至关重要。然而,当前的代理系统缺乏对何时以及为何人类干预的原理性理解,经常在关键决策点自主继续或请求不必要的确认。在本研究中,我们引入了建模人类干预的任务,以支持协作的网页任务执行。我们收集了CowCorpus数据集,包含400个真实用户网页导航轨迹,其中包含超过4,200个人类和代理的交互动作。我们识别了四种不同的人类与代理交互模式——放手监督、亲力亲为的监督、协作任务解决和完全用户接管。利用这些见解,我们训练语言模型(LMs)根据用户的交互风格预测用户何时可能干预,相比基础LMs,干预预测准确率提高了61.4%-63.4%。最后,我们在实时网页导航代理中部署了这些干预感知模型,并在用户研究中评估它们,发现用户对代理有用性的评价提高了26.5%。我们的结果表明,结构化的建模人类干预可以导致更具适应性和协作性的代理。
Summary / 总结
This work addresses the need for better understanding and modeling of human intervention in web agents. It introduces CowCorpus, a dataset of 400 real-user web navigation trajectories, and identifies four interaction patterns. By training language models to predict user intervention, the study achieves a 61.4-63.4% improvement in accuracy and reports a 26.5% increase in user-rated agent usefulness in a live deployment.
该研究旨在更好地理解和建模人类在网页代理中的干预,这对于协作完成网络任务至关重要。研究人员收集了400个真实用户网页导航轨迹的数据集CowCorpus,以识别四种交互模式。然后,他们训练语言模型来预测用户何时可能干预,准确率提高了61.4-63.4%。在一项用户研究中,这些模型被部署在实时网页导航代理中,用户对代理有用性的评价提高了26.5%。