arXiv 论文速递

Scalably Enhancing the Clinical Validity of a Task Benchmark with Physician Oversight

Authors: Junze Ye, Daniel Tawfik, Alex J. Goodell, Nikhil V. Kotha, Mark K. Buyyounouski, Mohsen Bayati

First: 2025-12-22T18:59:34+00:00 · Latest: 2025-12-22T18:59:34+00:00

Abstract

Automating the calculation of clinical risk scores offers a significant opportunity to reduce physician administrative burden and enhance patient care. The current standard for evaluating this capability is MedCalc-Bench, a large-scale dataset constructed using LLM-based feature extraction and rule-based aggregation. However, treating such model-generated benchmarks as static oracles risks enshrining historical model errors as evaluation gold standards, a problem dangerously amplified when these datasets serve as reward signals for Reinforcement Learning (RL). In this work, we propose viewing benchmarks for complex tasks such as clinical score computation as ''in-progress living documents'' that should be periodically re-evaluated as the processes for creating them improve. We introduce a systematic, physician-in-the-loop pipeline that leverages advanced agentic verifiers to audit and relabel MedCalc-Bench, utilizing automated triage to reserve scarce clinician attention for the most contentious instances. Our audit reveals that a notable fraction of original labels diverge from medical ground truth due to extraction errors, calculator logic mismatches, and clinical ambiguity. To study whether this label noise meaningfully impacts downstream RL training, we fine-tune a Qwen3-8B model via Group Relative Policy Optimization (GRPO) and demonstrate that training on corrected labels yields an 8.7% absolute improvement in accuracy over the original baseline -- validating that label noise materially affects model evaluation. These findings underscore that in safety-critical domains, rigorous benchmark maintenance is a prerequisite for genuine model alignment.

中文标题/摘要

标题：通过医师监督可扩展地增强临床效度任务基准

自动化临床风险评分的计算提供了一个显著的机会，以减轻医师的行政负担并提升患者护理质量。当前评估这种能力的标准是MedCalc-Bench，这是一个大规模数据集，使用基于LLM的功能提取和基于规则的聚合构建而成。然而，将此类模型生成的基准视为静态或acles，存在将历史模型错误固化为评估金标准的风险，当这些数据集作为强化学习（RL）的奖励信号时，这一问题被放大。在本文中，我们提出将复杂的任务（如临床评分计算）的基准视为“在进行中的活文档”，应随着创建它们的过程改进而定期重新评估。我们引入了一种系统性的、医师参与的管道，利用先进的代理验证者进行审计和重新标记MedCalc-Bench，利用自动化分诊来保留稀缺的临床注意力用于最争议的实例。我们的审计揭示，原始标签中有一部分与医学事实真相存在偏差，原因包括提取错误、计算器逻辑不匹配和临床模糊性。为了研究这种标签噪声是否对下游RL训练产生实质性影响，我们通过组相对策略优化（GRPO）微调了一个Qwen3-8B模型，并证明使用修正后的标签进行训练比原始基线提高了8.7%的绝对准确率——验证了标签噪声对模型评估有实质性影响。这些发现强调，在安全关键领域，严格的基准维护是实现真正模型对齐的前提。

Summary / 总结

This study addresses the issue of using model-generated benchmarks as static evaluation standards for clinical risk score calculation, which can perpetuate historical errors. It proposes a physician-in-the-loop pipeline to periodically re-evaluate benchmarks, using automated triage to focus clinician attention on contentious cases. The audit found significant label noise due to extraction errors, calculator logic mismatches, and clinical ambiguity. Fine-tuning a Qwen3-8B model with corrected labels improved accuracy by 8.7% compared to the original baseline, highlighting the impact of label noise on model evaluation.

该研究针对使用模型生成的基准作为临床风险评分计算评估标准可能导致历史错误的问题，提出了一种医生在环的周期性重新评估管道，利用自动化分诊将临床医生的关注点集中在争议案例上。审计发现，由于提取错误、计算器逻辑不匹配和临床歧义，存在大量标签噪声。使用修正后的标签微调Qwen3-8B模型，准确率提高了8.7%，验证了标签噪声对模型评估的影响。

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning

Authors: Apoorv Vyas, Heng-Jui Chang, Cheng-Fu Yang, Po-Yao Huang, Luya Gao, Julius Richter, Sanyuan Chen, Matt Le, Piotr Dollár, Christoph Feichtenhofer, Ann Lee, Wei-Ning Hsu

First: 2025-12-22T18:59:07+00:00 · Latest: 2025-12-22T18:59:07+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.

中文标题/摘要

标题：利用大规模多模态对应学习推动视听感知前沿

我们介绍了感知编码器视听感知（PE-AV），这是一种用于音频和视频理解的新一代编码器，通过扩展对比学习进行训练。基于PE，PE-AV 在扩展表示到音频方面做出了多项关键贡献，并且能够原生支持跨音频-视频、音频-文本和视频-文本模态的联合嵌入。PE-AV 统一的跨模态嵌入使我们能够实现新的任务，如语音检索，并在标准的音频和视频基准测试中达到新的最佳性能。我们通过构建强大的视听数据引擎，为数百万（O(100M)）的音频-视频对生成高质量的字幕，从而实现跨模态的一致大规模监督。我们的音频数据包括语音、音乐和一般声效，避免了先前工作中的单域限制。我们利用十个成对对比目标，表明跨模态和字幕类型对的扩展增强了对齐并提高了零样本性能。我们进一步通过使用帧级对比目标对PE-AV进行微调，开发了PE-A-Frame，使其能够实现细粒度的音频帧到文本对齐，用于声音事件检测等任务。

Summary / 总结

The research introduces Perception Encoder Audiovisual (PE-AV), a new encoder for audio and video understanding trained with scaled contrastive learning. PE-AV extends representations to audio and supports joint embeddings across multiple modalities, achieving state-of-the-art results on standard benchmarks. The study synthesizes high-quality captions for a large dataset of audio-video pairs, enabling consistent multimodal supervision. PE-AV uses ten pairwise contrastive objectives, demonstrating improved alignment and zero-shot performance. Fine-tuning with frame-level contrastive objectives further enhances audio-frame-to-text alignment for tasks like sound event detection.

研究旨在通过大规模多模态对应学习提升音频视觉感知。PE-AV 是一种新的编码器家族，通过扩展音频和视频理解，并支持跨音频视频、音频文本和视频文本模态的联合嵌入。关键实验发现包括在标准基准测试中达到新的最佳性能，并在语音检索任务中表现出色。该方法涉及为大量音频视频对生成高质量的字幕，以实现跨模态的一致监督，并使用十个成对对比目标来提高对齐和零样本性能。

Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models

Authors: Zixuan Ye, Quande Liu, Cong Wei, Yuanxing Zhang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhan Luo

First: 2025-12-22T18:59:03+00:00 · Latest: 2025-12-22T18:59:03+00:00

Comments: Project Page: https://zixuan-ye.github.io/VACoT/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recently, the introduction of Chain-of-Thought (CoT) has largely improved the generation ability of unified models. However, it is observed that the current thinking process during generation mainly focuses on the text consistency with the text prompt, ignoring the \textbf{visual context consistency} with the visual reference images during the multi-modal generation, e.g., multi-reference generation. The lack of such consistency results in the failure in maintaining key visual features (like human ID, object attribute, style). To this end, we integrate the visual context consistency into the reasoning of unified models, explicitly motivating the model to sustain such consistency by 1) Adaptive Visual Planning: generating structured visual check list to figure out the visual element of needed consistency keeping, and 2) Iterative Visual Correction: performing self-reflection with the guidance of check lists and refining the generated result in an iterative manner. To achieve this, we use supervised finetuning to teach the model how to plan the visual checking, conduct self-reflection and self-refinement, and use flow-GRPO to further enhance the visual consistency through a customized visual checking reward. The experiments show that our method outperforms both zero-shot unified models and those with text CoTs in multi-modal generation, demonstrating higher visual context consistency.

中文标题/摘要

标题：视觉感知CoT：在统一模型中实现高保真视觉一致性

最近，Chain-of-Thought (CoT) 的引入大大提高了统一模型的生成能力。然而，观察到当前生成过程中的思考过程主要集中在文本与文本提示的一致性上，忽视了在多模态生成（例如多参考生成）过程中与视觉参考图像的视觉上下文一致性。缺乏这种一致性导致关键视觉特征（如人类ID、对象属性、风格）的保持失败。为此，我们将视觉上下文一致性整合到统一模型的推理中，明确地激励模型保持这种一致性，通过1）自适应视觉规划：生成结构化的视觉检查清单，以确定需要保持的视觉元素，和2）迭代视觉校正：在检查清单的指导下进行自我反思，并以迭代方式改进生成结果。为了实现这一点，我们使用监督微调来教导模型如何规划视觉检查、进行自我反思和自我改进，并使用定制的视觉检查奖励进一步通过flow-GRPO增强视觉一致性。实验表明，我们的方法在多模态生成中优于零样本统一模型和具有文本CoT的模型，显示出更高的视觉上下文一致性。

Summary / 总结

The paper addresses the issue of visual context consistency in unified models during multi-modal generation, where the current CoT mainly focuses on text consistency. To improve this, the authors introduce Visual-Aware CoT, which includes Adaptive Visual Planning and Iterative Visual Correction. These methods help the model maintain key visual features. Experiments show that this approach outperforms both zero-shot unified models and those with text CoTs in terms of visual context consistency.

该论文通过引入Visual-Aware CoT解决了多模态生成中的视觉上下文一致性问题。方法包括自适应视觉规划，生成结构化的视觉检查清单以识别必要的视觉一致性，以及迭代视觉校正，通过自我反思逐步改进生成结果。实验表明，该方法在保持视觉上下文一致性方面优于零样本统一模型和仅具有文本CoT的模型。

From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs

Authors: Mingrui Wu, Zhaozhi Wang, Fangjinhua Wang, Jiaolong Yang, Marc Pollefeys, Tong Zhang

First: 2025-12-22T18:58:12+00:00 · Latest: 2025-12-22T18:58:12+00:00

Comments: Project page: https://harmlesssr.github.io/openbench/

Abs · PDF · Code1 · Code2 · Project1

Abstract

While Multimodal Large Language Models (MLLMs) have achieved impressive performance on semantic tasks, their spatial intelligence--crucial for robust and grounded AI systems--remains underdeveloped. Existing benchmarks fall short of diagnosing this limitation: they either focus on overly simplified qualitative reasoning or rely on domain-specific indoor data, constrained by the lack of outdoor datasets with verifiable metric ground truth. To bridge this gap, we introduce a large-scale benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors. This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions that span a hierarchical spectrum--from qualitative relational reasoning to quantitative metric and kinematic understanding. Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings. Further analysis using synthetic abnormal scenes and blinding tests confirms that current MLLMs depend heavily on linguistic priors instead of grounded visual reasoning. Our benchmark thus provides a principled platform for diagnosing these limitations and advancing physically grounded spatial intelligence.

中文标题/摘要

标题：从室内到开放世界：揭示MLLM的空间推理差距

虽然多模态大型语言模型（MLLMs）在语义任务上取得了令人印象深刻的性能，但它们的空间智能——对于稳健和基于现实的AI系统至关重要——仍然发展不足。现有的基准测试在诊断这一局限性方面存在不足：它们要么专注于过于简化的定性推理，要么依赖于室内数据，受限于缺乏具有可验证度量真实性的室外数据集。为了弥合这一差距，我们引入了一个基于行人视角视频的大规模基准，这些视频由同步立体相机、LiDAR和IMU/GPS传感器捕获。该数据集提供了度量精确的3D信息，使我们能够自动生成跨越从定性关系推理到定量度量和运动理解的层次谱系的空间推理问题。评估表明，在开放世界设置中，结构化室内基准测试中观察到的性能提升消失。进一步使用合成异常场景和盲测分析证实，当前的MLLMs严重依赖于语言先验而非基于视觉的推理。因此，我们的基准测试为诊断这些局限性并推进物理上基于现实的空间智能提供了一个原则性的平台。

Summary / 总结

This study addresses the spatial reasoning gap in Multimodal Large Language Models (MLLMs) by introducing a new benchmark using pedestrian-perspective videos with synchronized sensors. The benchmark provides metrically precise 3D information, enabling the generation of spatial reasoning questions from qualitative to quantitative levels. Experimental results show that MLLMs perform well in indoor settings but struggle in open-world scenarios, indicating a heavy reliance on linguistic priors rather than grounded visual reasoning. This work offers a principled platform for diagnosing and advancing physically grounded spatial intelligence in AI systems.

研究旨在通过使用带有同步传感器的行人视角视频引入新基准来解决MLLMs的空间推理差距。方法是生成来自精确3D数据的空间推理问题，涵盖从定性到定量的理解范围。关键发现表明，MLLMs在开放世界环境中表现不佳，更多依赖于语言先验而非视觉推理，表明需要在物理上接地的空间智能方面取得进展。

GenEnv: Difficulty-Aligned Co-Evolution Between LLM Agents and Environment Simulators

Authors: Jiacheng Guo, Ling Yang, Peter Chen, Qixin Xiao, Yinjie Wang, Xinzhe Juan, Jiahao Qiu, Ke Shen, Mengdi Wang

First: 2025-12-22T18:57:13+00:00 · Latest: 2025-12-22T18:57:13+00:00

Comments: Our codes are available at https://github.com/Gen-Verse/GenEnv

Abs · PDF · Code1 · Code2 · Code3

Abstract

Training capable Large Language Model (LLM) agents is critically bottlenecked by the high cost and static nature of real-world interaction data. We address this by introducing GenEnv, a framework that establishes a difficulty-aligned co-evolutionary game between an agent and a scalable, generative environment simulator. Unlike traditional methods that evolve models on static datasets, GenEnv instantiates a dataevolving: the simulator acts as a dynamic curriculum policy, continuously generating tasks specifically tailored to the agent's ``zone of proximal development''. This process is guided by a simple but effective $α$-Curriculum Reward, which aligns task difficulty with the agent's current capabilities. We evaluate GenEnv on five benchmarks, including API-Bank, ALFWorld, BFCL, Bamboogle, and TravelPlanner. Across these tasks, GenEnv improves agent performance by up to \textbf{+40.3\%} over 7B baselines and matches or exceeds the average performance of larger models. Compared to Gemini 2.5 Pro-based offline data augmentation, GenEnv achieves better performance while using 3.3$\times$ less data. By shifting from static supervision to adaptive simulation, GenEnv provides a data-efficient pathway for scaling agent capabilities.

中文标题/摘要

标题：GenEnv：LLM代理与环境模拟器之间的难度对齐协同进化

训练强大的大型语言模型（LLM）代理受到现实世界交互数据的高成本和静态性质的严重瓶颈。我们通过引入GenEnv框架解决了这一问题，该框架在代理和可扩展的生成环境模拟器之间建立了一种难度对齐的协同进化游戏。与传统方法在静态数据集上进化模型不同，GenEnv 实现了一个数据演变过程：模拟器充当动态课程策略，不断生成专门针对代理“最近发展区”的任务。这一过程由一个简单但有效的$α$-课程奖励引导，该奖励将任务难度与代理当前的能力对齐。我们在包括API-Bank、ALFWorld、BFCL、Bamboogle和TravelPlanner在内的五个基准上评估了GenEnv。在这些任务中，GenEnv将7B基线的代理性能提高了高达40.3%。与基于Gemini 2.5 Pro的离线数据增强相比，GenEnv在使用3.3倍少的数据的情况下实现了更好的性能。通过从静态监督转向适应性模拟，GenEnv为扩展代理能力提供了一种数据高效途径。

Summary / 总结

GenEnv addresses the challenge of training capable LLM agents by introducing a difficulty-aligned co-evolutionary framework between an agent and a generative environment simulator. The simulator dynamically generates tasks tailored to the agent's current capabilities, guided by an α-Curriculum Reward. GenEnv significantly improves agent performance on five benchmarks, achieving up to 40.3% better results than 7B baselines and using less data compared to Gemini 2.5 Pro-based methods.

GenEnv通过引入一个代理与生成环境模拟器之间的难度对齐协同进化框架来解决训练强大LLM代理的挑战。模拟器根据α-课程奖励动态生成针对代理当前能力的任务。GenEnv在五个基准测试中显著提高了代理性能，相比7B基线提高了高达40.3%，并且使用的数据比Gemini 2.5 Pro基线方法少。

LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?

Authors: Kaijian Zou, Aaron Xiong, Yunxiang Zhang, Frederick Zhang, Yueqi Ren, Jirong Yang, Ayoung Lee, Shitanshu Bhushan, Lu Wang

First: 2025-10-10T17:54:24+00:00 · Latest: 2025-12-22T18:56:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations such as lack of exceptionally challenging problems, insufficient test case coverage, reliance on online platform APIs that limit accessibility. To address these issues, we introduce LiveOIBench, a comprehensive benchmark featuring 403 expert-curated Olympiad-level competitive programming problems, each with an average of 60 expert-designed test cases. The problems are sourced directly from 72 official contests of 14 Informatics Olympiads in different regions conducted between 2023 and 2025. LiveOIBench distinguishes itself through four key features: (1) meticulously curated high-quality tasks with detailed subtask rubrics and extensive private test cases; (2) direct integration of elite contestant performance data to enable informative comparison against top-performing humans; (3) planned continuous, contamination-free updates from newly released Olympiad problems; and (4) a self-contained evaluation system facilitating offline and easy-to-reproduce assessments. Benchmarking 34 popular general-purpose and reasoning LLMs, we find that GPT-5 achieves a notable 81.76th percentile, a strong result that nonetheless falls short of top human contestants, who usually place above 90th. In contrast, among open-weight reasoning models, GPT-OSS-120B achieves only a 60th percentile, underscoring significant capability disparities from frontier closed models. Detailed analyses indicate that robust reasoning models prioritize precise problem analysis over excessive exploration, suggesting future models should emphasize structured analysis and minimize unnecessary exploration. All data, code, and leaderboard results are publicly available on our website.

中文标题/摘要

标题：LiveOIBench：大型语言模型能否在信息学奥林匹克竞赛中超越人类参赛者？

由于其复杂性和易于验证，编程竞赛问题逐渐成为评估大型语言模型（LLMs）编码能力的重要基准。然而，当前的编码基准存在一些限制，如缺乏极富挑战性的问题、测试用例覆盖不足以及依赖于限制访问性的在线平台API。为解决这些问题，我们引入了LiveOIBench，这是一个包含403个专家精选的信息学奥林匹克级别编程竞赛问题的综合基准，每个问题平均有60个专家设计的测试用例。这些问题直接来源于2023年至2025年间不同地区14个信息学奥林匹克官方竞赛中的72个。LiveOIBench通过四个关键特性脱颖而出：（1）精心挑选的高质量任务，附有详细的子任务评分标准和广泛的私有测试用例；（2）直接整合顶尖参赛者的表现数据，以实现与顶级人类选手的对比；（3）计划从新发布的奥林匹克问题中持续、无污染地更新；（4）一个自包含的评估系统，便于离线和易于复现的评估。在对34个流行的通用和推理LLMs进行基准测试后，我们发现GPT-5达到了显著的第81.76百分位，这是一个强大的结果，但仍低于顶级人类参赛者，后者通常排名在第90百分位以上。相比之下，开源推理模型GPT-OSS-120B仅达到第60百分位，突显了与前沿封闭模型相比的巨大能力差距。详细分析表明，稳健的推理模型更倾向于精确的问题分析而非过度探索，这表明未来模型应强调结构化分析并尽量减少不必要的探索。所有数据、代码和排行榜结果均可在我们的网站上公开获取。

Summary / 总结

The paper introduces LiveOIBench, a benchmark consisting of 403 expert-curated Olympiad-level competitive programming problems, to evaluate the coding capabilities of large language models (LLMs). It finds that GPT-5 performs at an 81.76th percentile, outperforming some open-weight models but still falling short of top human contestants. The study highlights the need for models to prioritize structured analysis over excessive exploration.

论文介绍了LiveOIBench，这是一个包含403个专家精选的奥林匹克级别编程竞赛问题的基准，用于评估大型语言模型（LLMs）的编程能力。研究发现，GPT-5的表现为第81.76百分位，虽然优于一些开源模型，但仍不及顶级人类选手。研究强调，模型应优先进行结构化分析而非过度探索。

VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation

Authors: Xinyao Liao, Qiyuan He, Kai Xu, Xiaoye Qu, Yicong Li, Wei Wei, Angela Yao

First: 2025-12-22T18:54:30+00:00 · Latest: 2025-12-22T18:54:30+00:00

Comments: 21 pages, 24 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-$π$, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-$π$ formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-$π$ introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling. The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency of tokens. VA-$π$ enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744). Code is available at https://github.com/Lil-Shake/VA-Pi.

中文标题/摘要

标题：VA-$π$: 变分策略对齐以实现像素感知的自回归生成

自回归（AR）视觉生成依赖于分词器将图像映射为和从离散序列中重建。然而，分词器被训练以从真实标记重建干净的图像，而AR生成器仅优化标记的似然性。这种不一致导致生成的标记序列可能解码成低质量的图像，而没有来自像素空间的直接监督。我们提出VA-$π$，一种轻量级的后训练框架，直接优化AR模型以像素空间目标为原则。VA-$π$将生成器-分词器对齐形式化为变分优化，推导出证据下界（ELBO），统一像素重建和自回归建模。为了在离散标记空间中优化，VA-$π$引入了一种基于强化学习的对齐策略，将AR生成器视为策略，使用像素空间重建质量作为其固有奖励。奖励通过预测的标记序列在教师强迫下如何重建原始图像来衡量，从而为模型提供直接的像素级指导，而无需昂贵的自由运行采样。ELBO的正则化项作为自然的正则化器，保持标记的分布一致性。VA-$π$能够快速适应现有的AR生成器，无需重新训练分词器或外部奖励模型。仅使用1%的ImageNet-1K数据和25分钟的调优，它将FID从14.36降低到7.65，将IS从86.55提高到116.70，同时在LlamaGen-XXL上也提高了文本到图像任务在GenEval中的表现，对于视觉生成模型（LlamaGen：从0.306提高到0.339）和统一多模态模型（Janus-Pro：从0.725提高到0.744）。代码可在https://github.com/Lil-Shake/VA-Pi/获取。

Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918)

Authors: Niclas Griesshaber, Jochen Streb

First: 2025-12-22T18:53:03+00:00 · Latest: 2025-12-22T18:53:03+00:00

Abs · PDF · Code1 · Code2

Abstract

We leverage multimodal large language models (LLMs) to construct a dataset of 306,070 German patents (1877-1918) from 9,562 archival image scans using our LLM-based pipeline powered by Gemini-2.5-Pro and Gemini-2.5-Flash-Lite. Our benchmarking exercise provides tentative evidence that multimodal LLMs can create higher quality datasets than our research assistants, while also being more than 795 times faster and 205 times cheaper in constructing the patent dataset from our image corpus. About 20 to 50 patent entries are embedded on each page, arranged in a double-column format and printed in Gothic and Roman fonts. The font and layout complexity of our primary source material suggests to us that multimodal LLMs are a paradigm shift in how datasets are constructed in economic history. We open-source our benchmarking and patent datasets as well as our LLM-based data pipeline, which can be easily adapted to other image corpora using LLM-assisted coding tools, lowering the barriers for less technical researchers. Finally, we explain the economics of deploying LLMs for historical dataset construction and conclude by speculating on the potential implications for the field of economic history.

中文标题/摘要

标题：基于档案图像扫描的多模态LLM构建历史数据集：德国专利（1877-1918）

我们利用多模态大型语言模型（LLM）从9,562份档案图像扫描中构建了1877年至1918年间共计306,070份德国专利数据集，使用了由Gemini-2.5-Pro和Gemini-2.5-Flash-Lite驱动的基于LLM的流水线。基准测试表明，多模态LLM可以创建比我们的研究助理更高的质量数据集，同时在构建专利数据集方面比我们图像语料库快795倍，成本低205倍。每页约包含20至50项专利条目，采用双栏格式，使用哥特体和罗马字体印刷。我们的原始资料的字体和布局复杂性使我们相信，多模态LLM是经济史中数据集构建范式的转变。我们开源了基准测试、专利数据集以及基于LLM的数据流水线，这些工具可以轻松适应其他图像语料库，降低了非技术研究人员的门槛。最后，我们解释了部署LLM构建历史数据集的经济性，并推测这对经济史领域的潜在影响。

Summary / 总结

This study uses multimodal large language models (LLMs) to construct a dataset of 306,070 German patents from 1877 to 1918 using 9,562 archival image scans. The pipeline, powered by Gemini-2.5-Pro and Gemini-2.5-Flash-Lite, demonstrates that LLMs can create higher quality datasets more than 795 times faster and 205 times cheaper than human research assistants. The study suggests that multimodal LLMs represent a significant shift in economic history dataset construction methods. The dataset and pipeline are open-sourced to facilitate broader use.

本研究利用多模态大型语言模型（LLMs）从1877年至1918年的9,562份档案图像扫描中构建了306,070项德国专利数据，使用了Gemini-2.5-Pro和Gemini-2.5-Flash-Lite驱动的流水线。研究显示，这些模型可以比人类研究助理更快（超过795倍）更便宜（超过205倍）地创建高质量的数据集。该数据集和基于LLM的数据流水线已开源，以促进经济史领域的类似项目。

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Authors: Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu

First: 2025-12-22T18:51:48+00:00 · Latest: 2025-12-22T18:51:48+00:00

Comments: Preprint. Our code is available at https://github.com/Trae1ounG/BuPO

Abs · PDF · Code1 · Code2 · Code3

Abstract

Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at https://github.com/Trae1ounG/BuPO.

中文标题/摘要

标题：自底向上的策略优化：您的语言模型策略中隐含着内部策略

现有的强化学习（RL）方法将大型语言模型（LLMs）视为单一统一的策略，忽视了其内部机制。因此，理解策略在各层和模块中的演变对于实现更精确的优化和揭示复杂的推理机制至关重要。在本文中，我们通过利用Transformer残差流的内在分割以及隐藏状态组成与未嵌入矩阵之间的等价性来分解语言模型策略，揭示了内部层策略和内部模块策略。通过分析内部策略的熵，我们发现：(a) 早期层保持高熵以进行探索，顶层层收敛到接近零的熵以进行细化，不同模型系列的收敛模式不同。(b) LLama在最终层的预测空间迅速收敛，而Qwen系列模型，尤其是Qwen3，表现出更接近人类的、逐步结构化的推理模式。受这些发现的启发，我们提出了自底向上的策略优化（BuPO），这是一种新的RL范式，在早期训练中直接优化内部层策略。通过在较低层对训练目标进行对齐，BuPO重建了基础的推理能力并实现了更好的性能。在复杂的推理基准测试中的广泛实验表明了我们方法的有效性。我们的代码可在https://github.com/Trae1ounG/BuPO获取。

Summary / 总结

This paper addresses the limitation of treating large language models as a single unified policy in reinforcement learning, proposing a method to decompose the policy into internal layer policies and modular policies. By analyzing the entropy of these internal policies, the authors find that early layers maintain high entropy for exploration, while top layers converge to low entropy for refinement. They introduce Bottom-up Policy Optimization (BuPO), which directly optimizes the internal layer policy during early training, leading to improved performance on complex reasoning benchmarks.

本文解决了现有强化学习方法将大型语言模型视为单一统一策略的局限性。通过分解语言模型策略，作者揭示了内部层策略和内部模块策略。他们发现早期层保持高熵以进行探索，而顶层则收敛到接近零的熵以进行细化。基于这些发现，他们提出了自底向上的策略优化（BuPO），该方法在早期训练中直接优化内部层策略，从而在复杂推理基准测试中表现出色。

Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis

Authors: Argha Kamal Samanta, Harshika Goyal, Vasudha Joshi, Tushar Mungle, Pabitra Mitra

First: 2025-12-22T18:41:45+00:00 · Latest: 2025-12-22T18:41:45+00:00

Comments: 14 pages, 14 figures

Abs · PDF · Code1 · Code2

Abstract

Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a joint transformer with modality-specific embeddings, trained using multiple objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading according to ICDR and SDRG schemes. Experimental results on the Brazilian Multilabel Ophthalmological Dataset (BRSET) demonstrate significant improvements over baseline models. Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP's 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, zero-shot evaluation on the unseen DeepEyeNet dataset validates strong generalizability with 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures cross-modal relationships in the medical domain, establishing both superior retrieval capabilities and robust diagnostic performance.

中文标题/摘要

标题：超越CLIP：知识增强的多模态变换器在糖尿病视网膜病变诊断中的跨模态对齐

糖尿病视网膜病变（DR）是全球可预防失明的主要原因，需要准确的自动化诊断系统。虽然通用领域的视觉-语言模型如对比语言-图像预训练（CLIP）在自然图像任务上表现良好，但在医学领域的应用中却遇到困难，特别是在眼科图像的跨模态检索方面。我们提出了一种新颖的知识增强联合嵌入框架，通过多模态变换器架构将视网膜底片图像、临床文本和结构化患者数据结合起来，以解决医学图像-文本对齐的关键差距。我们的方法为每种模态使用单独的编码器：用于视网膜图像的视觉变换器（ViT-B/16），用于临床叙述的Bio-ClinicalBERT，以及用于结构化人口统计和临床特征的多层感知器。这些模态通过具有模态特定嵌入的联合变换器融合，使用包括模态对之间的对比损失、图像和文本的重构损失以及根据ICDR和SDRG方案的DR严重程度分类损失的多个目标进行训练。在巴西多标签眼科数据集（BRSET）上的实验结果表明，与基线模型相比有显著改进。我们的框架在文本到图像检索性能上达到99.94%的召回率@1，而微调后的CLIP仅为1.29%，同时保持SDRG分类准确率为97.05%，ICDR分类准确率为97.97%。此外，对未见过的DeepEyeNet数据集的零样本评估验证了其强大的泛化能力，召回率@1为93.95%，而微调后的CLIP仅为0.22%。这些结果表明，我们的多模态训练方法有效地捕捉了医学领域的跨模态关系，建立了卓越的检索能力和稳健的诊断性能。

Summary / 总结

This study addresses the challenge of accurate automated diagnosis of diabetic retinopathy (DR) by proposing a knowledge-enhanced joint embedding framework using a multimodal transformer architecture. The framework integrates retinal fundus images, clinical text, and structured patient data through separate encoders and a joint transformer, trained with multiple objectives. Experimental results on the BRSET dataset show significant improvements over baseline models, achieving near-perfect text-to-image retrieval performance and maintaining state-of-the-art classification accuracy for DR severity grading. Zero-shot evaluation on the DeepEyeNet dataset further validates the model's generalizability.

研究旨在通过解决通用视觉-语言模型在医疗应用中的局限性，提高糖尿病视网膜病变的自动化诊断系统。作者提出了一种知识增强的联合嵌入框架，使用多模态变压器架构整合视网膜图像、临床文本和结构化患者数据。该框架在基准模型上取得了显著改进，实现了近乎完美的文本到图像检索性能和糖尿病视网膜病变严重程度分级的最新分类准确率。未见过的数据集上的零样本评估进一步验证了其强大的泛化能力。

Over++: Generative Video Compositing for Layer Interaction Effects

Authors: Luchao Qi, Jiaye Wu, Jun Myeong Choi, Cary Phillips, Roni Sengupta, Dan B Goldman

First: 2025-12-22T18:39:58+00:00 · Latest: 2025-12-22T18:39:58+00:00

Comments: Project page: https://overplusplus.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

In professional video compositing workflows, artists must manually create environmental interactions-such as shadows, reflections, dust, and splashes-between foreground subjects and background layers. Existing video generative models struggle to preserve the input video while adding such effects, and current video inpainting methods either require costly per-frame masks or yield implausible results. We introduce augmented compositing, a new task that synthesizes realistic, semi-transparent environmental effects conditioned on text prompts and input video layers, while preserving the original scene. To address this task, we present Over++, a video effect generation framework that makes no assumptions about camera pose, scene stationarity, or depth supervision. We construct a paired effect dataset tailored for this task and introduce an unpaired augmentation strategy that preserves text-driven editability. Our method also supports optional mask control and keyframe guidance without requiring dense annotations. Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in both effect generation and scene preservation.

中文标题/摘要

标题：Over++：生成视频合成以实现图层交互效果

在专业的视频合成工作流程中，艺术家必须手动创建前景主体与背景图层之间的环境交互，如阴影、反射、灰尘和水花。现有的视频生成模型难以在保留输入视频的同时添加这些效果，而当前的视频修补方法要么需要昂贵的逐帧掩码，要么产生不切实际的结果。我们引入了增强合成这一新任务，该任务根据文本提示和输入视频图层合成现实且半透明的环境效果，同时保留原始场景。为了解决这一任务，我们提出了Over++，这是一种无需假设摄像机姿态、场景静止或深度监督的视频效果生成框架。我们构建了一个针对此任务的配对效果数据集，并引入了一种无配对增强策略，以保留文本驱动的可编辑性。我们的方法还支持可选的掩码控制和关键帧指导，而无需密集标注。尽管训练数据有限，Over++仍能生成多样且现实的环境效果，并在效果生成和场景保留方面优于现有基线。

Summary / 总结

The research aims to automate the creation of environmental interactions in video compositing, such as shadows and reflections, which are typically done manually. Over++ is a framework that generates realistic, semi-transparent effects without requiring dense annotations or depth information. The method uses an unpaired augmentation strategy and can support optional mask control and keyframe guidance. Experimental results show that Over++ can produce diverse and realistic effects while preserving the original scene better than existing methods.

研究旨在自动化视频合成中的环境交互效果创建，如阴影和反射，这些通常需要手动完成。Over++ 是一个不需要密集标注或深度信息的框架，能够生成现实且半透明的效果。该方法使用未配对的增强策略，并支持可选的掩码控制和关键帧指导。实验结果表明，Over++ 能够生成多样且现实的效果，同时在场景保留方面优于现有方法。

CodeTF: One-stop Transformer Library for State-of-the-art Code LLMs

Authors: Nghi D. Q. Bui, Hung Le, Yue Wang, Junnan Li, Akhilesh Deepak Gotmare, Steven C. H. Hoi

First: 2023-05-31T05:24:48+00:00 · Latest: 2025-12-22T18:29:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Code intelligence plays a key role in transforming modern software engineering. Recently, deep learning-based models, especially Transformer-based large language models (LLMs), have demonstrated remarkable potential in tackling these tasks by leveraging massive open-source code data and programming language features. However, the development and deployment of such models often require expertise in both machine learning and software engineering, creating a barrier for the model adoption. In this paper, we present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Following the principles of modular design and extensible framework, we design CodeTF with a unified interface to enable rapid access and development across different types of models, datasets and tasks. Our library supports a collection of pretrained Code LLM models and popular code benchmarks, including a standardized interface to train and serve code LLMs efficiently, and data features such as language-specific parsers and utility functions for extracting code attributes. In this paper, we describe the design principles, the architecture, key modules and components, and compare with other related library tools. Finally, we hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering, providing a comprehensive open-source solution for developers, researchers, and practitioners.

中文标题/摘要

标题：CodeTF：最先进的代码LLM的即用型Transformer库

代码智能在现代软件工程的转型中扮演着关键角色。近年来，基于深度学习的模型，尤其是基于Transformer的大型语言模型（LLMs），通过利用海量的开源代码数据和编程语言特性，在处理这些任务方面展现了显著的潜力。然而，开发和部署此类模型通常需要同时具备机器学习和软件工程的专业知识，这成为模型采用的障碍。在本文中，我们介绍了CodeTF，一个开源的基于Transformer的库，用于最先进的代码LLM和代码智能。遵循模块化设计和可扩展框架的原则，我们设计了CodeTF，提供统一接口，以实现不同类型模型、数据集和任务的快速访问和开发。我们的库支持一系列预训练的代码LLM模型和流行的代码基准，包括标准化接口以高效训练和提供代码LLM服务，以及语言特定的解析器和用于提取代码属性的实用函数。在本文中，我们描述了设计原则、架构、关键模块和组件，并与其他相关库工具进行了比较。最后，我们希望CodeTF能够弥合机器学习/生成式AI与软件工程之间的差距，为开发者、研究人员和实践者提供一个全面的开源解决方案。

Summary / 总结

CodeTF is an open-source Transformer-based library designed to facilitate the development and deployment of state-of-the-art code large language models (LLMs) for code intelligence tasks. It follows modular design principles to provide a unified interface for accessing and developing different models, datasets, and tasks. Key features include support for pretrained Code LLM models, popular code benchmarks, and standardized interfaces for training and serving these models, along with language-specific parsers and utility functions. The library aims to bridge the gap between machine learning and software engineering, offering a comprehensive solution for developers and researchers.

本文介绍了CodeTF，一个开源的Transformer库，旨在简化先进代码大语言模型（LLM）的开发和部署。CodeTF的设计动机是弥合机器学习与软件工程之间的差距，提供一个统一的接口，以便快速访问和开发不同的模型和任务。关键实验发现包括支持预训练的Code LLM模型、流行的代码基准以及高效的训练和服务器标准化接口，同时还包括语言特定的解析器和提取代码属性的实用函数。

Exploring Zero-Shot ACSA with Unified Meaning Representation in Chain-of-Thought Prompting

Authors: Filippos Ventirozos, Peter Appleby, Matthew Shardlow

First: 2025-12-22T18:23:37+00:00 · Latest: 2025-12-22T18:23:37+00:00

Comments: 9 pages, 3 figures, 3 tables

Abs · PDF · Code1 · Code2

Abstract

Aspect-Category Sentiment Analysis (ACSA) provides granular insights by identifying specific themes within reviews and their associated sentiment. While supervised learning approaches dominate this field, the scarcity and high cost of annotated data for new domains present significant barriers. We argue that leveraging large language models (LLMs) in a zero-shot setting is a practical alternative where resources for data annotation are limited. In this work, we propose a novel Chain-of-Thought (CoT) prompting technique that utilises an intermediate Unified Meaning Representation (UMR) to structure the reasoning process for the ACSA task. We evaluate this UMR-based approach against a standard CoT baseline across three models (Qwen3-4B, Qwen3-8B, and Gemini-2.5-Pro) and four diverse datasets. Our findings suggest that UMR effectiveness may be model-dependent. Whilst preliminary results indicate comparable performance for mid-sized models such as Qwen3-8B, these observations warrant further investigation, particularly regarding the potential applicability to smaller model architectures. Further research is required to establish the generalisability of these findings across different model scales.

中文标题/摘要

标题：探索零样本ACS分析在链式思考提示中的统一意义表示

方面-类别情感分析（ACSA）通过识别评论中的特定主题及其相关情感提供了详细的见解。尽管监督学习方法主导了这一领域，但新领域标注数据的稀缺性和高成本构成了重大障碍。我们认为，在数据标注资源有限的情况下，利用大型语言模型（LLMs）在零样本设置中是一种实用的替代方案。在本文中，我们提出了一种新颖的链式思考（CoT）提示技术，利用中间的统一意义表示（UMR）来结构化ACS分析任务的推理过程。我们使用Qwen3-4B、Qwen3-8B和Gemini-2.5-Pro三种模型和四个不同数据集对基于UMR的方法与标准CoT基线进行了评估。我们的研究结果表明，UMR的有效性可能依赖于模型。虽然初步结果显示中型模型如Qwen3-8B的性能相当，但这些观察结果需要进一步研究，特别是关于其在较小模型架构中的潜在适用性。进一步的研究是必要的，以确定这些发现是否适用于不同模型规模。

Summary / 总结

The paper explores the use of a Chain-of-Thought (CoT) prompting technique with an intermediate Unified Meaning Representation (UMR) for Aspect-Category Sentiment Analysis (ACSA) in a zero-shot setting. This approach aims to address the challenges of limited annotated data for new domains. The study evaluates this UMR-based method against a standard CoT baseline across three models and four datasets, finding that UMR effectiveness varies by model size, with mid-sized models showing comparable performance to standard CoT prompting.

研究旨在通过使用大型语言模型（LLMs）提出零样本方法来解决新领域中注释数据有限的问题，以应对方面-类别情感分析（ACSA）的挑战。方法包括使用中间统一意义表示（UMR）的链式思考（CoT）提示技术来结构化推理过程。在三个模型和四个数据集上的评估表明，UMR的有效性因模型大小而异，中型模型如Qwen3-8B的性能与标准CoT基线相当，但需要进一步研究以适用于较小的模型架构。

AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models

Authors: Heng Zhang, Haichuan Hu, Yaomin Shen, Weihao Yu, Yilei Yuan, Haochen You, Guo Cheng, Zijian Zhang, Lubin Gan, Huihui Wei, Hao Zhang, Jin Huang

First: 2025-09-16T06:16:05+00:00 · Latest: 2025-12-22T18:22:20+00:00

Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. However, existing Mixture of Experts (MoE) approaches face challenges due to the asymmetry between visual and linguistic processing. Visual information is spatially complete, while language requires maintaining sequential context. As a result, MoE models struggle to balance modality-specific features and cross-modal interactions. Through systematic analysis, we observe that language experts in deeper layers progressively lose contextual grounding and rely more on parametric knowledge rather than utilizing the provided visual and linguistic information. To address this, we propose AsyMoE, a novel architecture that models this asymmetry using three specialized expert groups. We design intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to suppress parametric biases and maintain contextual grounding. Extensive experiments demonstrate that AsyMoE achieves 26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific MoE respectively, with 25.45% fewer activated parameters than dense models.

中文标题/摘要

标题：AsyMoE：利用模态不对称性增强大型视觉-语言模型专家专业化

大型视觉-语言模型（LVLMs）通过扩展架构和大量训练，在多模态任务中表现出色。然而，现有的混合专家（MoE）方法由于视觉和语言处理之间的不对称性而面临挑战。视觉信息是空间上完整的，而语言需要保持顺序上下文。因此，MoE模型难以平衡模态特定特征和跨模态交互。通过系统分析，我们观察到，语言专家在更深的层中逐渐失去上下文基础，更多依赖参数知识，而不是利用提供的视觉和语言信息。为了解决这个问题，我们提出了一种新的AsyMoE架构，该架构使用三个专门的专家组来建模这种不对称性。我们设计了跨模态专家进行模态特定处理，超曲面跨模态专家进行分层跨模态交互，并设计了证据优先语言专家以抑制参数偏差并保持上下文基础。广泛的实验表明，与vanilla MoE和模态特定MoE相比，AsyMoE分别实现了26.58%和15.45%的准确率提升，且激活的参数比密集模型少25.45%。

Summary / 总结

The paper addresses the challenges faced by existing Mixture of Experts (MoE) approaches in large Vision-Language Models (LVLMs) due to the asymmetry between visual and linguistic processing. It proposes AsyMoE, which models this asymmetry using three specialized expert groups: intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to maintain contextual grounding. Despite the promising results, the authors withdrew the submission due to a fundamental error in the methodology that affects the validity of the main results.

论文针对现有混合专家（MoE）方法在大型视觉-语言模型（LVLM）中面临的视觉和语言处理不对称性挑战。它提出了AsyMoE，该方法使用三种专门的专家组：模态内专家进行模态特定处理，超球体跨模态专家进行分层跨模态交互，以及证据优先语言专家以保持上下文接地。尽管取得了有希望的结果，但由于方法中的根本错误影响了主要结果的有效性，作者撤回了提交。

GraphGeo: Multi-Agent Debate Framework for Visual Geo-localization with Heterogeneous Graph Neural Networks

Authors: Heng Zheng, Yuling Shi, Xiaodong Gu, Haochen You, Zijian Zhang, Lubin Gan, Hao Zhang, Wenjun Huang, Jin Huang

First: 2025-11-02T11:58:55+00:00 · Latest: 2025-12-22T18:21:18+00:00

Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results

Abs · PDF · Code1 · Code2

Abstract

Visual geo-localization requires extensive geographic knowledge and sophisticated reasoning to determine image locations without GPS metadata. Traditional retrieval methods are constrained by database coverage and quality. Recent Large Vision-Language Models (LVLMs) enable direct location reasoning from image content, yet individual models struggle with diverse geographic regions and complex scenes. Existing multi-agent systems improve performance through model collaboration but treat all agent interactions uniformly. They lack mechanisms to handle conflicting predictions effectively. We propose \textbf{GraphGeo}, a multi-agent debate framework using heterogeneous graph neural networks for visual geo-localization. Our approach models diverse debate relationships through typed edges, distinguishing supportive collaboration, competitive argumentation, and knowledge transfer. We introduce a dual-level debate mechanism combining node-level refinement and edge-level argumentation modeling. A cross-level topology refinement strategy enables co-evolution between graph structure and agent representations. Experiments on multiple benchmarks demonstrate GraphGeo significantly outperforms state-of-the-art methods. Our framework transforms cognitive conflicts between agents into enhanced geo-localization accuracy through structured debate.

中文标题/摘要

标题：GraphGeo：基于异构图神经网络的多智能体视觉地理定位框架

视觉地理定位需要广泛的空间知识和复杂的推理来确定图像位置，而不依赖GPS元数据。传统的检索方法受限于数据库的覆盖范围和质量。最近的大规模视觉-语言模型（LVLMs）能够直接从图像内容进行位置推理，但单个模型难以处理多样的地理区域和复杂的场景。现有的多智能体系统通过模型协作来提高性能，但所有智能体交互均处理一致。它们缺乏有效处理相互矛盾预测的机制。我们提出 **GraphGeo**，一种使用异构图神经网络的多智能体辩论框架，用于视觉地理定位。我们的方法通过类型化的边来建模多样的辩论关系，区分支持性合作、竞争性论辩和知识转移。我们引入了一种双层辩论机制，结合节点级细化和边级论辩建模。跨层拓扑细化策略使图结构和智能体表示能够共同进化。在多个基准上的实验表明，GraphGeo 显著优于现有最佳方法。我们的框架通过结构化的辩论将智能体之间的认知冲突转化为增强的地理定位准确性。

Summary / 总结

GraphGeo is a multi-agent debate framework for visual geo-localization using heterogeneous graph neural networks. It models diverse debate relationships and introduces a dual-level debate mechanism for node-level refinement and edge-level argumentation. Experiments show that GraphGeo significantly outperforms state-of-the-art methods. However, the submission was withdrawn due to a fundamental error in the methodology that affects the validity of the results.

GraphGeo 是一种使用异构图神经网络的多代理辩论框架，用于视觉地理定位。它建模了多样化的辩论关系，并引入了双层辩论机制以增强推理。实验表明，GraphGeo 显著优于现有最佳方法。然而，由于方法中的根本错误，该提交已被作者撤回，这影响了主要结果的有效性。

GraphShaper: Geometry-aware Alignment for Improving Transfer Learning in Text-Attributed Graphs

Authors: Heng Zhang, Tianyi Zhang, Yuling Shi, Xiaodong Gu, Yaomin Shen, Haochen You, Zijian Zhang, Yilei Yuan, Jin Huang

First: 2025-10-14T02:48:50+00:00 · Latest: 2025-12-22T18:20:12+00:00

Comments: This submission has been withdrawn by the authors due to a fundamental error in the methodology that affects the validity of the main results

Abs · PDF · Code1 · Code2

Abstract

Graph foundation models represent a transformative paradigm for learning transferable representations across diverse graph domains. Recent methods leverage large language models to unify graph and text modalities into a shared representation space using contrastive learning. However, systematic evaluations reveal significant performance degradation at structural boundaries where distinct topological patterns converge, with accuracy losses exceeding 20 percentage points. This issue arises from a key limitation: current methods assume all graph structures can be encoded within a single Euclidean space. In reality, tree structures require hyperbolic geometry to preserve hierarchical branching, while cyclic patterns depend on spherical geometry for closure properties. At structural boundaries, nodes experience conflicting geometric constraints that uniform encoding spaces cannot resolve. This raises a crucial challenge: \textbf{Can alignment frameworks be designed to respect the intrinsic geometric diversity of graph structures?} We introduce \textbf{GraphShaper}, a geometry-aware framework that enhances graph encoding through multi-geometric specialization. Our approach employs expert networks tailored to different geometric spaces, dynamically computing fusion weights to adaptively integrate geometric properties based on local structural characteristics. This adaptive fusion preserves structural integrity before alignment with text embeddings. Extensive experiments demonstrate that GraphShaper achieves 9.47\% accuracy improvements on citation networks and 7.63\% on social networks in zero-shot settings.

中文标题/摘要

标题：GraphShaper：几何感知对齐以提高文本标注图的迁移学习

图基础模型代表了一种变革性的范式，用于在多种图领域中学习可迁移的表示。最近的方法利用大型语言模型通过对比学习将图和文本模态统一到共享表示空间中。然而，系统评估表明，在不同拓扑模式交汇的结构边界处，性能显著下降，准确率损失超过20个百分点。这一问题源于一个关键限制：当前方法假设所有图结构都可以编码在一个单一的欧几里得空间内。实际上，树结构需要双曲几何来保持层次分支，而循环模式则依赖于球面几何以保持闭合性质。在结构边界处，节点会受到冲突的几何约束，而统一的编码空间无法解决这一问题。这提出了一个关键挑战：\textbf{能否设计对齐框架以尊重图结构的内在几何多样性？} 我们引入了\textbf{GraphShaper}，这是一种几何感知框架，通过多几何专业化增强图编码。我们的方法采用针对不同几何空间定制的专家网络，动态计算融合权重，根据局部结构特征自适应地整合几何属性。这种自适应融合在对齐文本嵌入之前保留了结构完整性。大量实验表明，在零样本设置下，GraphShaper在引用网络中实现了9.47%的准确率提升，在社交网络中实现了7.63%的提升。

Summary / 总结

GraphShaper is a geometry-aware framework designed to improve transfer learning in text-attributed graphs by addressing the limitations of current methods that assume a single Euclidean space for all graph structures. It uses expert networks for different geometric spaces and dynamically computes fusion weights to adaptively integrate geometric properties based on local structural characteristics. Experiments show that GraphShaper improves accuracy by 9.47% on citation networks and 7.63% on social networks in zero-shot settings, but the results are withdrawn due to a fundamental error in the methodology.

GraphShaper 是一个几何感知框架，旨在通过解决现有方法假设所有图结构都能在一个单一欧几里得空间中编码的问题来提高文本标注图的迁移学习效果。它使用针对不同几何空间的专家网络，并动态计算融合权重以根据局部结构特征适配性地整合几何属性。实验表明，GraphShaper 在零样本设置下将引文网络的准确性提高了 9.47%，社交网络提高了 7.63%，但由于方法中的根本错误，结果被撤回。

WANDER: An Explainable Decision-Support Framework for HPC

Authors: Ankur Lahiry, Banooqa Banday, Tanzima Z. Islam

First: 2025-06-04T15:15:23+00:00 · Latest: 2025-12-22T18:19:18+00:00

Abs · PDF · Code1 · Code2

Abstract

High-performance computing (HPC) systems expose many interdependent configuration knobs that impact runtime, resource usage, power, and variability. Existing predictive tools model these outcomes, but do not support structured exploration, explanation, or guided reconfiguration. We present WANDER, a decision-support framework that synthesizes alternate configurations using counterfactual analysis aligned with user goals and constraints. We introduce a composite trade-off score that ranks suggestions based on prediction uncertainty, consistency between feature-target relationships using causal models, and similarity between feature distributions against historical data. To our knowledge, WANDER is the first such system to unify prediction, exploration, and explanation for HPC tuning under a common query interface. Across multiple datasets WANDER generates interpretable and trustworthy, human-readable alternatives that guide users to achieve their performance objectives.

中文标题/摘要

标题：WANDER：一种面向HPC的可解释决策支持框架

高性能计算（HPC）系统暴露了许多相互依赖的配置旋钮，这些旋钮影响运行时、资源使用、功耗和变异性。现有的预测工具可以建模这些结果，但不支持结构化的探索、解释或引导式重新配置。我们提出了WANDER，这是一种决策支持框架，通过基于用户目标和约束的反事实分析综合替代配置。我们引入了一个综合权衡分数，该分数根据预测不确定性、因果模型中特征-目标关系的一致性以及与历史数据中特征分布的相似性对建议进行排名。据我们所知，WANDER是第一个将预测、探索和解释统一到一个查询接口中的HPC调优系统。在多个数据集上，WANDER生成可解释和可靠的、易于理解的替代方案，引导用户实现其性能目标。

Summary / 总结

WANDER is a decision-support framework for HPC that uses counterfactual analysis to synthesize alternate configurations based on user goals and constraints. It ranks suggestions using a composite score that considers prediction uncertainty, causal consistency, and historical data similarity. WANDER generates interpretable and trustworthy alternatives that help users achieve their performance objectives across various datasets.

研究旨在通过开发WANDER框架解决HPC系统配置的复杂性，该框架利用反事实分析提出与用户目标一致的配置建议。WANDER结合预测不确定性、因果一致性以及历史数据相似性来评估建议，生成可解释且可靠的替代方案，帮助用户实现性能目标。WANDER是首个在单一查询接口中统一预测、探索和解释的HPC调优系统。

InterPose: Learning to Generate Human-Object Interactions from Large-Scale Web Videos

Authors: Yangsong Zhang, Abdul Ahad Butt, Gül Varol, Ivan Laptev

First: 2025-08-31T09:38:59+00:00 · Latest: 2025-12-22T18:14:12+00:00

Comments: Accepted to 3DV 2026. Project page: https://mael-zys.github.io/InterPose/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Human motion generation has shown great advances thanks to the recent diffusion models trained on large-scale motion capture data. Most of existing works, however, currently target animation of isolated people in empty scenes. Meanwhile, synthesizing realistic human-object interactions in complex 3D scenes remains a critical challenge in computer graphics and robotics. One obstacle towards generating versatile high-fidelity human-object interactions is the lack of large-scale datasets with diverse object manipulations. Indeed, existing motion capture data is typically restricted to single people and manipulations of limited sets of objects. To address this issue, we propose an automatic motion extraction pipeline and use it to collect interaction-rich human motions. Our new dataset InterPose contains 73.8K sequences of 3D human motions and corresponding text captions automatically obtained from 45.8K videos with human-object interactions. We perform extensive experiments and demonstrate InterPose to bring significant improvements to state-of-the-art methods for human motion generation. Moreover, using InterPose we develop an LLM-based agent enabling zero-shot animation of people interacting with diverse objects and scenes.

中文标题/摘要

标题：InterPose：从大规模网络视频中学习生成人体-物体交互

人体运动生成得益于最近在大规模运动捕捉数据上训练的扩散模型取得了巨大进展。然而，现有的大多数工作目前主要针对空旷场景中孤立人物的动画生成。同时，在复杂3D场景中合成逼真的人体-物体交互仍然是计算机图形学和机器人学中的一个关键挑战。生成多样化人体-物体交互的一个障碍是缺乏大规模包含多种物体操作的数据集。事实上，现有的运动捕捉数据通常仅限于单个人和有限种类物体的操作。为了解决这一问题，我们提出了一种自动运动提取流水线，并使用它来收集富含交互的人体运动。我们的新数据集InterPose包含来自45,800个包含人体-物体交互的视频中自动获取的73,800个3D人体运动序列及其对应的文本描述。我们进行了广泛的实验，并证明InterPose能够显著提高人体运动生成的最新方法的效果。此外，我们使用InterPose开发了一个基于LLM的代理，使其能够零样本动画生成与多种物体和场景互动的人。

Summary / 总结

The research aims to generate realistic human-object interactions in complex 3D scenes by addressing the lack of large-scale datasets with diverse object manipulations. The authors propose an automatic motion extraction pipeline to collect 73.8K sequences of 3D human motions from 45.8K videos, creating a new dataset called InterPose. Experiments show that InterPose significantly improves state-of-the-art methods for human motion generation and enables zero-shot animation of people interacting with various objects and scenes.

研究旨在通过解决缺乏包含多样化物体操作的大规模数据集问题，生成复杂3D场景中的真实人类物体交互。作者提出了一种自动动作提取管道，从45.8K包含人类物体交互的视频中收集了73.8K个3D人类动作序列，创建了InterPose数据集。实验表明，该数据集显著提升了最先进的动作生成方法，并使人们能够零样本动画与各种物体和场景的交互。

Generative diffusion models for agricultural AI: plant image generation, indoor-to-outdoor translation, and expert preference alignment

Authors: Da Tan, Michael Beck, Christopher P. Bidinosti, Robert H. Gulden, Christopher J. Henry

First: 2025-12-22T18:07:08+00:00 · Latest: 2025-12-22T18:07:08+00:00

Abs · PDF · Code1 · Code2

Abstract

The success of agricultural artificial intelligence depends heavily on large, diverse, and high-quality plant image datasets, yet collecting such data in real field conditions is costly, labor intensive, and seasonally constrained. This paper investigates diffusion-based generative modeling to address these challenges through plant image synthesis, indoor-to-outdoor translation, and expert preference aligned fine tuning. First, a Stable Diffusion model is fine tuned on captioned indoor and outdoor plant imagery to generate realistic, text conditioned images of canola and soybean. Evaluation using Inception Score, Frechet Inception Distance, and downstream phenotype classification shows that synthetic images effectively augment training data and improve accuracy. Second, we bridge the gap between high resolution indoor datasets and limited outdoor imagery using DreamBooth-based text inversion and image guided diffusion, generating translated images that enhance weed detection and classification with YOLOv8. Finally, a preference guided fine tuning framework trains a reward model on expert scores and applies reward weighted updates to produce more stable and expert aligned outputs. Together, these components demonstrate a practical pathway toward data efficient generative pipelines for agricultural AI.

中文标题/摘要

标题：农业AI中的生成扩散模型：植物图像生成、室内到室外转换和专家偏好对齐

农业人工智能的成功高度依赖于大量、多样且高质量的植物图像数据集，但在实际田间条件下收集此类数据成本高、劳动密集且受季节限制。本文通过植物图像合成、室内到室外转换和专家偏好对齐微调，研究基于扩散的生成建模以应对这些挑战。首先，使用带有描述的室内和室外植物图像对Stable Diffusion模型进行微调，生成符合文本条件的油菜和大豆的逼真图像。使用Inception Score、Frechet Inception Distance和下游表型分类评估显示，合成图像有效地扩充了训练数据并提高了准确性。其次，通过DreamBooth基于文本反转和图像引导扩散，弥合高分辨率室内数据集与有限的室外图像之间的差距，生成增强杂草检测和分类的翻译图像。最后，一个偏好引导的微调框架在专家评分上训练奖励模型，并应用奖励加权更新以产生更稳定和专家对齐的输出。这些组件共同展示了农业AI中数据高效生成管道的实用途径。

Summary / 总结

This paper addresses the challenges of collecting large, diverse, and high-quality plant image datasets for agricultural AI by using diffusion-based generative models. It fine-tunes a Stable Diffusion model on indoor and outdoor plant images to generate realistic images of canola and soybean, improving phenotype classification accuracy. The study also bridges the gap between indoor and outdoor imagery using DreamBooth-based text inversion and image-guided diffusion, enhancing weed detection and classification. Additionally, a preference-guided fine-tuning framework aligns model outputs with expert preferences, producing more stable and aligned results. These methods collectively offer a practical solution for data-efficient generative pipelines in agricultural AI.

本文通过使用基于扩散的生成模型来解决农业人工智能中收集大量、多样且高质量植物图像数据的挑战。作者通过在室内和室外植物图像上微调Stable Diffusion模型，生成了真实且文本条件化的油菜和大豆图像，从而提高了表型分类的准确性。他们还使用DreamBooth基于文本的反转和图像引导的扩散，将高分辨率的室内图像转换为室外图像，从而增强了杂草的检测和分类。此外，还开发了一种偏好导向的微调框架，根据专家评分训练奖励模型，并应用奖励加权更新，以生成更稳定且与专家偏好一致的结果。

Diacritic Restoration for Low-Resource Indigenous Languages: Case Study with Bribri and Cook Islands Māori

Authors: Rolando Coto-Solano, Daisy Li, Manoela Teleginski Ferraz, Olivia Sasse, Cha Krupka, Sharid Loáiciga, Sally Akevai Tenamu Nicholas

First: 2025-12-22T18:04:24+00:00 · Latest: 2025-12-22T18:04:24+00:00

Abs · PDF · Code1 · Code2

Abstract

We present experiments on diacritic restoration, a form of text normalization essential for natural language processing (NLP) tasks. Our study focuses on two extremely under-resourced languages: Bribri, a Chibchan language spoken in Costa Rica, and Cook Islands Māori, a Polynesian language spoken in the Cook Islands. Specifically, this paper: (i) compares algorithms for diacritics restoration in under-resourced languages, including tonal diacritics, (ii) examines the amount of data required to achieve target performance levels, (iii) contrasts results across varying resource conditions, and (iv) explores the related task of diacritic correction. We find that fine-tuned, character-level LLMs perform best, likely due to their ability to decompose complex characters into their UTF-8 byte representations. In contrast, massively multilingual models perform less effectively given our data constraints. Across all models, reliable performance begins to emerge with data budgets of around 10,000 words. Zero-shot approaches perform poorly in all cases. This study responds both to requests from the language communities and to broader NLP research questions concerning model performance and generalization in under-resourced contexts.

中文标题/摘要

标题：低资源土著语言的重音符号恢复：布里布里语和库克群岛毛利语案例研究

我们介绍了重音符号恢复实验，这是一种对于自然语言处理（NLP）任务至关重要的文本规范化形式。本研究集中于两种极度资源匮乏的语言：布里布里语，一种在危地马拉和哥斯达黎加使用的奇布查语系语言；以及库克群岛毛利语，一种在库克群岛使用的波利尼西亚语系语言。具体而言，本文：(i) 比较了在资源匮乏语言中进行重音符号恢复的算法，包括声调重音符号；(ii) 考察了达到目标性能水平所需的数据量；(iii) 在不同资源条件下对比了结果；(iv) 探讨了相关任务——重音符号修正。我们发现微调的字符级语言模型表现最佳，这可能是因为它们能够将复杂字符分解为其UTF-8字节表示。相比之下，大规模多语言模型在我们的数据限制下表现较差。在所有模型中，可靠性能开始出现的数据预算约为10,000个词。零样本方法在所有情况下表现不佳。本研究既回应了语言社区的要求，也回应了更广泛的NLP研究问题，即在资源匮乏环境中模型性能和泛化能力。

Summary / 总结

This paper investigates diacritic restoration for two low-resource languages, Bribri and Cook Islands Māori, comparing various algorithms and data requirements. It finds that fine-tuned character-level language models outperform other methods, especially when data is limited, while zero-shot approaches are ineffective. Reliable performance starts to appear with around 10,000 words of training data.

本研究探讨了两种低资源语言Bribri和Cook Islands Māori的音标恢复，比较了不同算法和数据需求。研究发现，微调的字符级语言模型在有限数据情况下表现最佳，而零样本方法效果不佳。可靠的性能在大约10,000个单词的训练数据下开始显现。

Source-Optimal Training is Transfer-Suboptimal

Authors: C. Evans Hedges

First: 2025-11-11T16:16:10+00:00 · Latest: 2025-12-22T17:58:14+00:00

Abs · PDF · Code1 · Code2

Abstract

We prove that training a source model optimally for its own task is generically suboptimal when the objective is downstream transfer. We study the source-side optimization problem in L2-SP ridge regression and show a fundamental mismatch between the source-optimal and transfer-optimal source regularization: outside of a measure-zero set, $τ_0^* \neq τ_S^*$. We characterize the transfer-optimal source penalty $τ_0^*$ as a function of task alignment and identify an alignment-dependent reversal: with imperfect alignment ($0<ρ<1$), transfer benefits from stronger source regularization, while in super-aligned regimes ($ρ>1$), transfer benefits from weaker regularization. In isotropic settings, the decision of whether transfer helps is independent of the target sample size and noise, depending only on task alignment and source characteristics. We verify the linear predictions in a synthetic ridge regression experiment, and we present CIFAR-10 experiments as evidence that the source-optimal versus transfer-optimal mismatch can persist in nonlinear networks.

中文标题/摘要

标题：源数据最优训练是迁移亚最优的

我们证明，当目标是下游迁移时，对源模型在其自身任务上进行最优训练通常是亚最优的。我们研究了L2-SP岭回归中的源侧优化问题，并展示了源最优和迁移最优源正则化之间的根本性不匹配：在测度零集之外，$τ_0^* \neq τ_S^*$。我们将迁移最优的源惩罚$τ_0^*$表示为任务对齐的函数，并识别出一种依赖对齐的反转：在不完全对齐的情况下（$0<ρ<1$），迁移从更强的源正则化中受益，而在超对齐的区域（$ρ>1$），迁移从较弱的正则化中受益。在各向同性设置中，迁移是否有益的决定因素与目标样本大小和噪声无关，仅取决于任务对齐和源特性。我们在合成岭回归实验中验证了线性预测，并通过CIFAR-10实验展示了源最优与迁移最优之间的不匹配可以持续存在于非线性网络中。

Summary / 总结

The paper demonstrates that optimizing a source model for its own task leads to suboptimal performance when the goal is to transfer to a downstream task. It analyzes the source-side optimization problem in L2-SP ridge regression and finds that the optimal source regularization for transfer learning differs from the optimal source regularization for the source task. The study identifies that with imperfect task alignment, stronger source regularization benefits transfer, whereas in super-aligned regimes, weaker regularization is better. In isotropic settings, the decision on whether transfer helps is determined by task alignment and source characteristics, not by target sample size or noise. The findings are supported by both synthetic ridge regression experiments and CIFAR-10 experiments with nonlinear networks.

论文表明，针对源任务进行最优训练通常不利于下游迁移任务。它在L2-SP岭回归中分析了源侧优化问题，并发现源最优和迁移最优的源正则化存在差异。迁移最优的源正则化取决于任务对齐情况，对于不完全对齐的情况，更强的正则化更有益，而对于超对齐的情况，更弱的正则化更有益。这些发现通过合成岭回归实验和CIFAR-10实验得到了验证，显示了这种差异在非线性网络中依然存在。

Exploring the features used for summary evaluation by Human and GPT

Authors: Zahra Sadeghi, Evangelos Milios, Frank Rudzicz

First: 2025-12-22T17:54:49+00:00 · Latest: 2025-12-22T17:54:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Summary assessment involves evaluating how well a generated summary reflects the key ideas and meaning of the source text, requiring a deep understanding of the content. Large Language Models (LLMs) have been used to automate this process, acting as judges to evaluate summaries with respect to the original text. While previous research investigated the alignment between LLMs and Human responses, it is not yet well understood what properties or features are exploited by them when asked to evaluate based on a particular quality dimension, and there has not been much attention towards mapping between evaluation scores and metrics. In this paper, we address this issue and discover features aligned with Human and Generative Pre-trained Transformers (GPTs) responses by studying statistical and machine learning metrics. Furthermore, we show that instructing GPTs to employ metrics used by Human can improve their judgment and conforming them better with human responses.

中文标题/摘要

标题：探索用于摘要评估的人类和GPT所使用的特点

摘要评估涉及评估生成的摘要如何反映源文本的关键思想和意义，需要对内容有深刻的理解。大型语言模型（LLMs）已被用于自动化这一过程，作为裁判来根据原始文本评估摘要。尽管之前的研究调查了LLMs与人类反应的一致性，但尚不清楚在评估特定质量维度时，它们利用了哪些属性或特征，也没有太多关注评估分数与指标之间的映射。在本文中，我们解决了这一问题，并通过研究统计和机器学习指标发现与人类和生成式预训练变换器（GPTs）反应对齐的特征。此外，我们展示了指示GPTs使用人类使用的指标可以提高它们的判断力，并使它们更好地与人类反应一致。

Summary / 总结

This paper explores the features used by humans and GPTs to evaluate summaries, aiming to understand the alignment between human and machine evaluations. The authors use statistical and machine learning methods to identify these features and find that instructing GPTs to use metrics similar to human judgments can improve their evaluation accuracy and better align with human responses.

本文探讨了人类和GPT在评估摘要时使用的特点。通过使用统计和机器学习方法，研究发现了与人类和GPT响应对齐的特征，并展示了指导GPT使用人类使用的度量标准可以提高其评估质量，使其响应更加符合人类的判断。

From Easy to Hard: Progressive Active Learning Framework for Infrared Small Target Detection with Single Point Supervision

Authors: Chuang Yu, Jinmiao Zhao, Yunpeng Liu, Sicheng Zhao, Yimian Dai, Xiangyu Yue

Venue: ICCV 2025

First: 2024-12-15T11:08:49+00:00 · Latest: 2025-12-22T17:52:35+00:00

Comments: Accepted by ICCV 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recently, single-frame infrared small target (SIRST) detection with single point supervision has drawn wide-spread attention. However, the latest label evolution with single point supervision (LESPS) framework suffers from instability, excessive label evolution, and difficulty in exerting embedded network performance. Inspired by organisms gradually adapting to their environment and continuously accumulating knowledge, we construct an innovative Progressive Active Learning (PAL) framework, which drives the existing SIRST detection networks progressively and actively recognizes and learns harder samples. Specifically, to avoid the early low-performance model leading to the wrong selection of hard samples, we propose a model pre-start concept, which focuses on automatically selecting a portion of easy samples and helping the model have basic task-specific learning capabilities. Meanwhile, we propose a refined dual-update strategy, which can promote reasonable learning of harder samples and continuous refinement of pseudo-labels. In addition, to alleviate the risk of excessive label evolution, a decay factor is reasonably introduced, which helps to achieve a dynamic balance between the expansion and contraction of target annotations. Extensive experiments show that existing SIRST detection networks equipped with our PAL framework have achieved state-of-the-art (SOTA) results on multiple public datasets. Furthermore, our PAL framework can build an efficient and stable bridge between full supervision and single point supervision tasks. Our code is available at https://github.com/YuChuang1205/PAL

中文标题/摘要

标题：从易到难：基于单点监督的渐进式主动学习框架用于红外小目标检测

近年来，单帧红外小目标（SIRST）检测在单点监督下引起了广泛关注。然而，最新的基于单点监督的标签演化（LESPS）框架存在不稳定、过度标签演化以及难以发挥嵌入网络性能的问题。受生物体逐渐适应环境并不断积累知识的启发，我们构建了一个创新的渐进式主动学习（PAL）框架，该框架能够逐步且主动地识别和学习更难的样本。具体而言，为了避免早期低性能模型导致错误选择难样本，我们提出了一个模型预启动概念，该概念侧重于自动选择一部分简单样本，帮助模型获得基本的任务特定学习能力。同时，我们提出了一个改进的双重更新策略，可以促进对更难样本的合理学习和伪标签的持续精炼。此外，为了缓解过度标签演化的风险，合理引入了衰减因子，这有助于在目标注释的扩展和收缩之间实现动态平衡。广泛的实验表明，配备我们PAL框架的现有SIRST检测网络在多个公开数据集上取得了最先进的（SOTA）结果。此外，我们的PAL框架可以建立从全监督任务到单点监督任务的高效且稳定的桥梁。我们的代码可在https://github.com/YuChuang1205/PAL获取

Summary / 总结

This paper addresses the instability and excessive label evolution issues in single-point supervision for infrared small target detection. It introduces a Progressive Active Learning (PAL) framework that helps the model learn progressively by first focusing on easy samples to build basic task-specific learning capabilities. The framework also includes a dual-update strategy and a decay factor to refine pseudo-labels and maintain a balance in label evolution. Experiments show that this approach achieves state-of-the-art results on multiple public datasets and provides a stable bridge between full and single-point supervision tasks.

本文提出了一种渐进式主动学习（PAL）框架，以解决单帧红外小目标检测中单点监督带来的挑战。该框架旨在逐步和主动地识别和学习更难的样本，克服现有LESPS框架中的不稳定性和过度标签演化问题。关键组件包括一种模型预启动概念，用于选择容易样本，以及一种双重更新策略来细化伪标签。实验表明，PAL框架在多个公开数据集上达到了最先进的结果，并提供了一种从全监督任务到单点监督任务的稳定桥梁。

MauBERT: Universal Phonetic Inductive Biases for Few-Shot Acoustic Units Discovery

Authors: Angelo Ortiz Tandazo, Manel Khentout, Youssef Benchekroun, Thomas Hueber, Emmanuel Dupoux

First: 2025-12-22T17:47:49+00:00 · Latest: 2025-12-22T17:47:49+00:00

Abs · PDF · Code1 · Code2

Abstract

This paper introduces MauBERT, a multilingual extension of HuBERT that leverages articulatory features for robust cross-lingual phonetic representation learning. We continue HuBERT pre-training with supervision based on a phonetic-to-articulatory feature mapping in 55 languages. Our models learn from multilingual data to predict articulatory features or phones, resulting in language-independent representations that capture multilingual phonetic properties. Through comprehensive ABX discriminability testing, we show MauBERT models produce more context-invariant representations than state-of-the-art multilingual self-supervised learning models. Additionally, the models effectively adapt to unseen languages and casual speech with minimal self-supervised fine-tuning (10 hours of speech). This establishes an effective approach for instilling linguistic inductive biases in self-supervised speech models.

中文标题/摘要

标题：MauBERT：多语言音素诱导偏置以实现少量样本声学单元发现

本文介绍了MauBERT，这是一种基于元音素特征的HuBERT多语言扩展，用于跨语言音素表示学习。我们使用55种语言的音素到元音素特征映射的监督继续HuBERT预训练。我们的模型从多语言数据中学习预测元音素特征或音素，从而产生语言独立的表示，捕捉多语言音素特性。通过全面的ABX可分辨性测试，我们表明MauBERT模型产生的表示比最先进的多语言自监督学习模型更具上下文不变性。此外，模型在少量自监督微调（10小时语音）的情况下能够有效适应未见过的语言和非正式语音。这为在自监督语音模型中植入语言诱导偏置提供了有效的方法。

Summary / 总结

MauBERT is a multilingual extension of HuBERT that uses articulatory features to learn robust cross-lingual phonetic representations. By pre-training with phonetic-to-articulatory feature mapping in 55 languages, MauBERT models generate language-independent representations that capture multilingual phonetic properties. Experimental results show that MauBERT produces more context-invariant representations than existing models and can effectively adapt to unseen languages and casual speech with minimal fine-tuning.

MauBERT 是 HuBERT 的多语言扩展，利用发音特征来学习跨语言的稳健音素表示。通过在 55 种语言上进行基于音素到发音特征映射的预训练，MauBERT 模型生成了语言独立的表示，能够捕捉多语言音素特性。实验结果表明，MauBERT 模型生成的表示比现有自监督学习模型更具上下文不变性，并且可以通过少量的自监督微调（10 小时语音）适应未见过的语言和非正式语音。

MapTrace: Scalable Data Generation for Route Tracing on Maps

Authors: Artemis Panagopoulou, Aveek Purohit, Achin Kulshrestha, Soroosh Yazdani, Mohit Goyal

First: 2025-12-22T17:45:39+00:00 · Latest: 2025-12-22T17:45:39+00:00

Abs · PDF · Code1 · Code2

Abstract

While Multimodal Large Language Models have achieved human-like performance on many visual and textual reasoning tasks, their proficiency in fine-grained spatial understanding, such as route tracing on maps remains limited. Unlike humans, who can quickly learn to parse and navigate maps, current models often fail to respect fundamental path constraints, in part due to the prohibitive cost and difficulty of collecting large-scale, pixel-accurate path annotations. To address this, we introduce a scalable synthetic data generation pipeline that leverages synthetic map images and pixel-level parsing to automatically produce precise annotations for this challenging task. Using this pipeline, we construct a fine-tuning dataset of 23k path samples across 4k maps, enabling models to acquire more human-like spatial capabilities. Using this dataset, we fine-tune both open-source and proprietary MLLMs. Results on MapBench show that finetuning substantially improves robustness, raising success rates by up to 6.4 points, while also reducing path-tracing error (NDTW). These gains highlight that fine-grained spatial reasoning, absent in pretrained models, can be explicitly taught with synthetic supervision.

中文标题/摘要

标题：MapTrace：地图路线跟踪的可扩展数据生成

尽管多模态大型语言模型在许多视觉和文本推理任务上已经达到了人类水平的表现，但在地图上的细粒度空间理解，如路线跟踪方面的能力仍然有限。与人类能够快速学习解析和导航地图不同，当前的模型往往未能遵守基本的路径约束，部分原因是大规模、像素级准确路径注解的收集成本高昂且难度大。为了解决这一问题，我们引入了一种可扩展的合成数据生成流水线，该流水线利用合成地图图像和像素级解析来自动产生这一具有挑战性任务的精确注解。使用此流水线，我们构建了一个包含4000张地图上23000条路径样本的微调数据集，使模型能够获得更接近人类的空间能力。使用此数据集，我们对开源和专有MLLM进行了微调。MapBench上的结果显示，微调显著提高了鲁棒性，成功率提高了多达6.4个百分点，同时减少了路径跟踪误差（NDTW）。这些增益表明，预训练模型中缺乏的细粒度空间推理能力可以通过合成监督明确地进行教学。

Summary / 总结

The research aims to improve the spatial reasoning capabilities of Multimodal Large Language Models (MLLMs) in tasks such as route tracing on maps. To address the challenge of collecting pixel-accurate path annotations, the authors developed a scalable synthetic data generation pipeline. This pipeline produced a fine-tuning dataset of 23,000 path samples across 4,000 maps, which was used to fine-tune both open-source and proprietary MLLMs. The results showed that fine-tuning improved robustness, increasing success rates by up to 6.4 points and reducing path-tracing error (NDTW).

研究旨在通过解决模型在解析和导航地图方面的局限性，提升多模态大型语言模型的空间理解能力，特别是在路线规划方面。方法是开发了一个可扩展的合成数据生成管道，该管道利用合成地图图像和像素级解析自动生成精确的标注。该管道生成了跨越4k地图的23k路径样本的数据集，并用于微调开源和专有模型。微调显著提高了鲁棒性，并将路径规划误差降低了最多6.4个百分点，通过MapBench基准测试验证。

SoK: Are Watermarks in LLMs Ready for Deployment?

Authors: Kieu Dang, Phung Lai, NhatHai Phan, Yelong Shen, Ruoming Jin, Abdallah Khreishah, My T. Thai

First: 2025-06-05T21:12:51+00:00 · Latest: 2025-12-22T17:37:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) have transformed natural language processing, demonstrating impressive capabilities across diverse tasks. However, deploying these models introduces critical risks related to intellectual property violations and potential misuse, particularly as adversaries can imitate these models to steal services or generate misleading outputs. We specifically focus on model stealing attacks, as they are highly relevant to proprietary LLMs and pose a serious threat to their security, revenue, and ethical deployment. While various watermarking techniques have emerged to mitigate these risks, it remains unclear how far the community and industry have progressed in developing and deploying watermarks in LLMs. To bridge this gap, we aim to develop a comprehensive systematization for watermarks in LLMs by 1) presenting a detailed taxonomy for watermarks in LLMs, 2) proposing a novel intellectual property classifier to explore the effectiveness and impacts of watermarks on LLMs under both attack and attack-free environments, 3) analyzing the limitations of existing watermarks in LLMs, and 4) discussing practical challenges and potential future directions for watermarks in LLMs. Through extensive experiments, we show that despite promising research outcomes and significant attention from leading companies and community to deploy watermarks, these techniques have yet to reach their full potential in real-world applications due to their unfavorable impacts on model utility of LLMs and downstream tasks. Our findings provide an insightful understanding of watermarks in LLMs, highlighting the need for practical watermarks solutions tailored to LLM deployment.

中文标题/摘要

标题：综述：LLM中的水印是否准备好部署？

大型语言模型（LLM）已经改变了自然语言处理，展示了在各种任务中令人印象深刻的性能。然而，在部署这些模型时，存在与知识产权侵权和潜在滥用相关的关键风险，特别是对手可以模仿这些模型以窃取服务或生成误导性输出。我们特别关注模型窃取攻击，因为它们对专有LLM具有高度相关性，并且对它们的安全性、收入和伦理部署构成了严重威胁。虽然已经出现了各种水印技术来缓解这些风险，但尚不清楚社区和行业在开发和部署LLM中的水印方面取得了多大进展。为了弥合这一差距，我们旨在通过1）提出LLM中水印的详细分类体系，2）提出一种新颖的知识产权分类器来探索水印在攻击和无攻击环境下的有效性和影响，3）分析现有LLM中水印的局限性，以及4）讨论水印在LLM中的实际挑战和潜在未来方向，来开发LLM中水印的全面系统化。通过广泛的实验，我们表明，尽管在部署水印方面取得了有希望的研究成果，并且领先公司和社区给予了大量关注，但由于这些技术对LLM模型实用性和下游任务的不利影响，它们尚未在实际应用中达到其全部潜力。我们的研究结果为LLM中的水印提供了深入的理解，突显了需要针对LLM部署定制的实际水印解决方案的必要性。

Summary / 总结

This paper aims to address the risks associated with deploying Large Language Models (LLMs), particularly focusing on model stealing attacks. The authors develop a comprehensive taxonomy for watermarks in LLMs and propose a novel intellectual property classifier to evaluate the effectiveness of watermarks. Despite promising research, the paper finds that current watermarking techniques have not fully realized their potential due to negative impacts on model utility and downstream tasks. The study highlights the need for practical watermarking solutions for LLM deployment.

论文旨在通过开发大型语言模型（LLM）中水印的综合系统化来应对部署风险，特别是模型窃取攻击。作者提出了详细分类法，提出了一种新的知识产权分类器，并分析了现有水印的局限性。通过大量实验表明，尽管水印显示出潜力，但它们目前对模型实用性和下游任务有不利影响，表明需要针对LLM部署的实际解决方案。

InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback

Authors: Boyuan Chen, Donghai Hong, Jiaming Ji, Jiacheng Zheng, Bowen Dong, Jiayi Zhou, Kaile Wang, Juntao Dai, Xuyao Wang, Wenqi Chen, Qirui Zheng, Wenxin Li, Sirui Han, Yike Guo, Yaodong Yang

First: 2025-05-29T19:00:42+00:00 · Latest: 2025-12-22T17:36:54+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

As multimodal large models (MLLMs) continue to advance across challenging tasks, a key question emerges: What essential capabilities are still missing? A critical aspect of human learning is continuous interaction with the environment -- not limited to language, but also involving multimodal understanding and generation. To move closer to human-level intelligence, models must similarly support multi-turn, multimodal interaction. In particular, they should comprehend interleaved multimodal contexts and respond coherently in ongoing exchanges. In this work, we present an initial exploration through the InterMT -- the first preference dataset for multi-turn multimodal interaction, grounded in real human feedback. In this exploration, we particularly emphasize the importance of human oversight, introducing expert annotations to guide the process, motivated by the fact that current MLLMs lack such complex interactive capabilities. InterMT captures human preferences at both global and local levels into nine sub-dimensions, consists of 15.6k prompts, 52.6k multi-turn dialogue instances, and 32.4k human-labeled preference pairs. To compensate for the lack of capability for multi-modal understanding and generation, we introduce an agentic workflow that leverages tool-augmented MLLMs to construct multi-turn QA instances. To further this goal, we introduce InterMT-Bench to assess the ability of MLLMs in assisting judges with multi-turn, multimodal tasks. We demonstrate the utility of \InterMT through applications such as judge moderation and further reveal the multi-turn scaling law of judge model. We hope the open-source of our data can help facilitate further research on aligning current MLLMs to the next step. Our project website can be found at https://pku-intermt.github.io .

中文标题/摘要

标题：InterMT：多轮交错偏好对齐与人类反馈

随着多模态大型模型（MLLMs）在各种挑战性任务中不断进步，一个关键问题出现了：还缺少哪些基本能力？人类学习的一个关键方面是与环境进行持续的互动——不仅限于语言，还包括多模态的理解和生成。为了更接近人类级别的智能，模型必须同样支持多轮、多模态的互动。特别是，它们应该理解交错的多模态上下文，并在持续的交流中做出连贯的回应。在这项工作中，我们通过InterMT进行初步探索——这是第一个用于多轮多模态互动的偏好数据集，基于真实的人类反馈。在这项探索中，我们特别强调了人类监督的重要性，引入了专家注释来指导过程，因为当前的MLLMs缺乏这种复杂的互动能力。InterMT在九个子维度上捕捉了人类在全局和局部的偏好，包含15600个提示、52600个多轮对话实例和32400个人类标注的偏好对。为了弥补多模态理解和生成能力的不足，我们引入了一种代理工作流，利用工具增强的MLLMs构建多轮问答实例。为了进一步实现这一目标，我们引入了InterMT-Bench来评估MLLMs在多轮、多模态任务中协助裁判的能力。我们通过诸如裁判调节等应用展示了InterMT的实用性，并进一步揭示了裁判模型的多轮扩展规律。我们希望开源数据能够帮助促进进一步研究，使当前的MLLMs向下一步迈进。我们的项目网站可以在https://pku-intermt.github.io 查看。

Summary / 总结

This work addresses the need for multimodal large language models (MLLMs) to support multi-turn, multimodal interaction, a capability crucial for achieving human-level intelligence. The authors introduce InterMT, a preference dataset for multi-turn multimodal interaction, grounded in real human feedback. InterMT includes 15.6k prompts, 52.6k dialogue instances, and 32.4k preference pairs, capturing human preferences at both global and local levels. The authors also introduce an agentic workflow and InterMT-Bench to evaluate MLLMs' ability to assist in multi-turn, multimodal tasks, demonstrating the utility of InterMT in applications such as judge moderation and revealing the multi-turn scaling law of judge models.

研究旨在解决多模态大型语言模型（MLLMs）在多轮多模态交互方面的缺失能力。为此，作者引入了InterMT，这是一个基于人类反馈的多轮多模态交互偏好数据集。该数据集包含15.6k个提示、52.6k个对话实例和32.4k个偏好对，涵盖了人类偏好在全局和局部层面的捕捉。作者还提出了一种使用工具增强的MLLMs构建多轮问答实例的代理工作流，并引入了InterMT-Bench来评估MLLMs在多轮多模态任务中的能力。研究展示了InterMT在法官调解等应用中的实用性，并揭示了法官模型的多轮扩展规律。

No Data? No Problem: Robust Vision-Tabular Learning with Missing Values

Authors: Marta Hasny, Laura Daza, Keno Bressem, Maxime Di Folco, Julia Schnabel

First: 2025-12-22T17:35:32+00:00 · Latest: 2025-12-22T17:35:32+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large-scale medical biobanks provide imaging data complemented by extensive tabular information, such as demographics or clinical measurements. However, this abundance of tabular attributes does not reflect real-world datasets, where only a subset of attributes may be available. This discrepancy calls for methods that can leverage all the tabular data during training while remaining robust to missing values at inference. To address this challenge, we propose RoVTL (Robust Vision-Tabular Learning), a framework designed to handle any level of tabular data availability, from 0% to 100%. RoVTL comprises two key stages: contrastive pretraining, where we introduce tabular attribute missingness as data augmentation to promote robustness, and downstream task tuning using a gated cross-attention module for multimodal fusion. During fine-tuning, we employ a novel Tabular More vs. Fewer loss that ranks performance based on the amount of available tabular data. Combined with disentangled gradient learning, this enables consistent performance across all tabular data completeness scenarios. We evaluate RoVTL on cardiac MRI scans from the UK Biobank, demonstrating superior robustness to missing tabular data compared to prior methods. Furthermore, RoVTL successfully generalizes to an external cardiac MRI dataset for multimodal disease classification, and extends to the natural images domain, achieving robust performance on a car advertisements dataset. The code is available at https://github.com/marteczkah/RoVTL.

中文标题/摘要

标题：无数据？没问题：具有缺失值的稳健视觉-表格学习

大规模医学生物银行提供补充了广泛表格信息（如人口统计学或临床测量）的成像数据。然而，这些丰富的表格属性并不能反映现实世界的数据集，其中可能只有部分属性可用。这种差异需要能够在训练期间利用所有表格数据，同时在推理时对缺失值保持稳健的方法。为应对这一挑战，我们提出了RoVTL（稳健视觉-表格学习）框架，该框架设计用于处理从0%到100%的任何水平的表格数据可用性。RoVTL 包含两个关键阶段：对比预训练，我们通过将表格属性缺失性作为数据增强来促进鲁棒性；以及使用门控跨注意力模块进行下游任务调优以实现多模态融合。在微调期间，我们采用了一种新颖的表格更多 vs. 更少损失，该损失基于可用的表格数据量来排名性能。结合分离梯度学习，这使得在所有表格数据完整性场景中都能保持一致的性能。我们在英国生物银行的心脏MRI扫描上评估了RoVTL，展示了与先前方法相比对缺失表格数据的优越鲁棒性。此外，RoVTL 成功地将外部心脏MRI数据集推广到多模态疾病分类，并扩展到自然图像领域，在汽车广告数据集上实现了稳健的性能。代码可在 https://github.com/marteczkah/RoVTL/ 获取。

Summary / 总结

The paper addresses the challenge of leveraging tabular data in vision tasks when some attributes are missing. It introduces RoVTL (Robust Vision-Tabular Learning), which includes contrastive pretraining with data augmentation for missing tabular attributes and a gated cross-attention module for multimodal fusion. The framework uses a novel loss function to rank performance based on available tabular data and disentangled gradient learning to maintain consistent performance. Experiments on cardiac MRI scans and car advertisements show that RoVTL outperforms previous methods in handling missing tabular data and generalizes well to different domains.

论文提出了RoVTL（稳健的视觉-表格学习）框架，该框架包括对比预训练和下游任务调优两个阶段，其中预训练阶段通过将表格属性缺失作为数据增强来提高鲁棒性，调优阶段使用门控跨注意力模块进行多模态融合。该框架在心脏MRI扫描等不同数据集和领域中展示了对缺失表格数据的优越鲁棒性和成功泛化能力。

Shape it Up! Restoring LLM Safety during Finetuning

Authors: ShengYun Peng, Pin-Yu Chen, Jianfeng Chi, Seongmin Lee, Duen Horng Chau

Venue: NeurIPS

First: 2025-05-22T18:05:16+00:00 · Latest: 2025-12-22T17:30:15+00:00

Comments: NeurIPS'25

Abs · PDF · Code1 · Code2 · Code3

Abstract

Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks: even a few harmful examples can compromise safety alignment. A common mitigation strategy is to update the model more strongly on examples deemed safe, while downweighting or excluding those flagged as unsafe. However, because safety context can shift within a single example, updating the model equally on both harmful and harmless parts of a response is suboptimal-a coarse treatment we term static safety shaping. In contrast, we propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content. To enable such fine-grained control during finetuning, we introduce a key insight: guardrail models, traditionally used for filtering, can be repurposed to evaluate partial responses, tracking how safety risk evolves throughout the response, segment by segment. This leads to the Safety Trajectory Assessment of Response (STAR), a token-level signal that enables shaping to operate dynamically over the training sequence. Building on this, we present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families-all without compromising capability on intended tasks. We encourage future safety research to build on dynamic shaping principles for stronger mitigation against evolving finetuning risks. Our code is publicly available at https://github.com/poloclub/star-dss.

中文标题/摘要

标题：塑形！提升大语言模型调优期间的安全性

对大型语言模型（LLMs）进行微调可以实现用户特定的定制，但会引入关键的安全风险：即使是一些有害示例也可能破坏安全对齐。一种常见的缓解策略是更强烈地更新被认定为安全的示例，同时减少或排除标记为不安全的示例。然而，由于安全上下文在一个示例内部可能会发生变化，因此对响应中的有害和无害部分进行同等更新是不理想的——我们称之为静态安全性塑形。相反，我们提出了一种动态安全性塑形（DSS）框架，该框架利用细粒度的安全信号来强化从响应的安全部分中学习，同时抑制不安全的内容。为了在微调期间实现这种细粒度的控制，我们引入了一个关键见解：通常用于过滤的护栏模型可以重新用于评估部分响应，跟踪响应中安全风险如何逐段演变。这导致了响应安全性轨迹评估（STAR），一种标记级信号，使塑形能够在训练序列中动态地进行。在此基础上，我们提出了STAR-DSS，它根据STAR分数进行引导，能够稳健地缓解微调风险，并在各种威胁、数据集和模型家族中实现显著的安全改进，而不会牺牲在预期任务上的能力。我们鼓励未来的安全研究建立在动态塑形原则之上，以更有效地应对不断变化的微调风险。我们的代码已公开发布在https://github.com/poloclub/star-dss。

Summary / 总结

The paper addresses the safety risks during the finetuning of large language models (LLMs) by proposing a dynamic safety shaping (DSS) framework. This framework uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content. The authors introduce a token-level signal called Safety Trajectory Assessment of Response (STAR) to enable dynamic shaping during finetuning. The results show that STAR-DSS robustly mitigates finetuning risks and improves safety across various threats, datasets, and model families without compromising the models' capability for intended tasks.

研究旨在通过提出动态安全塑造（DSS）方法解决大型语言模型（LLMs）微调中的安全风险，该方法利用细粒度的安全信号选择性地强化从响应的安全部分学习，同时抑制不安全内容。关键方法是将护栏模型重新用于评估部分响应，生成一个基于标记的安全轨迹评估响应（STAR）信号，以在微调过程中实现动态塑造。实验结果表明，STAR-DSS能够稳健地缓解微调风险，并在各种威胁、数据集和模型家族中显著提高安全性，同时不影响任务性能。

Enhancing Multi-Agent Collaboration with Attention-Based Actor-Critic Policies

Authors: Hugo Garrido-Lestache Belinchon, Jeremy Kedziora

First: 2025-07-30T15:48:38+00:00 · Latest: 2025-12-22T17:22:59+00:00

Comments: 11 pages

Abs · PDF · Code1 · Code2

Abstract

This paper introduces Team-Attention-Actor-Critic (TAAC), a reinforcement learning algorithm designed to enhance multi-agent collaboration in cooperative environments. TAAC employs a Centralized Training/Centralized Execution scheme incorporating multi-headed attention mechanisms in both the actor and critic. This design facilitates dynamic, inter-agent communication, allowing agents to explicitly query teammates, thereby efficiently managing the exponential growth of joint-action spaces while ensuring a high degree of collaboration. We further introduce a penalized loss function which promotes diverse yet complementary roles among agents. We evaluate TAAC in a simulated soccer environment against benchmark algorithms representing other multi-agent paradigms, including Proximal Policy Optimization and Multi-Agent Actor-Attention-Critic. We find that TAAC exhibits superior performance and enhanced collaborative behaviors across a variety of metrics (win rates, goal differentials, Elo ratings, inter-agent connectivity, balanced spatial distributions, and frequent tactical interactions such as ball possession swaps).

中文标题/摘要

标题：基于注意力机制的演员-评论家策略增强多智能体协作

本文介绍了团队注意力-演员-评论家（TAAC），这是一种用于增强合作环境中多智能体协作的强化学习算法。TAAC采用集中训练/集中执行方案，并在演员和评论家中引入多头注意力机制。这种设计促进了智能体之间的动态通信，使智能体能够明确查询队友，从而有效管理联合动作空间的指数增长，同时确保高度的协作。我们还引入了一种惩罚性损失函数，以促进智能体之间多样而互补的角色。我们在模拟足球环境中将TAAC与代表其他多智能体范式的基准算法（包括近端策略优化和多智能体演员-注意力-评论家）进行评估。我们发现，TAAC在多种指标（胜率、进球差、Elo评分、智能体间连接性、均衡的空间分布以及频繁的战术互动，如球权交换）上表现出更优的性能和增强的协作行为。