arXiv 论文速递

VideoMaMa: Mask-Guided Video Matting via Generative Prior

Authors: Sangbeom Lim, Seoung Wug Oh, Jiahui Huang, Heeji Yoon, Seungryong Kim, Joon-Young Lee

First: 2026-01-20T18:59:56+00:00 · Latest: 2026-01-20T18:59:56+00:00

Comments: Project page: https://cvlab-kaist.github.io/VideoMaMa/

Abstract

Generalizing video matting models to real-world videos remains a significant challenge due to the scarcity of labeled data. To address this, we present Video Mask-to-Matte Model (VideoMaMa) that converts coarse segmentation masks into pixel accurate alpha mattes, by leveraging pretrained video diffusion models. VideoMaMa demonstrates strong zero-shot generalization to real-world footage, even though it is trained solely on synthetic data. Building on this capability, we develop a scalable pseudo-labeling pipeline for large-scale video matting and construct the Matting Anything in Video (MA-V) dataset, which offers high-quality matting annotations for more than 50K real-world videos spanning diverse scenes and motions. To validate the effectiveness of this dataset, we fine-tune the SAM2 model on MA-V to obtain SAM2-Matte, which outperforms the same model trained on existing matting datasets in terms of robustness on in-the-wild videos. These findings emphasize the importance of large-scale pseudo-labeled video matting and showcase how generative priors and accessible segmentation cues can drive scalable progress in video matting research.

中文标题/摘要

标题：VideoMaMa：基于生成先验的遮罩引导视频抠图

将视频抠图模型推广到真实世界视频中仍然是一个重大挑战，因为标注数据稀缺。为了解决这一问题，我们提出了Video Mask-to-Matte Model（VideoMaMa），该模型通过利用预训练的视频扩散模型，将粗略的分割掩码转换为像素级准确的alpha抠图。尽管VideoMaMa仅在合成数据上进行训练，但它在真实世界片段上的零样本泛化能力仍然很强。在此基础上，我们开发了一个可扩展的伪标签流水线，用于大规模视频抠图，并构建了Matting Anything in Video（MA-V）数据集，该数据集为超过50,000个真实世界视频提供了高质量的抠图注释，这些视频涵盖了多种场景和动作。为了验证该数据集的有效性，我们在MA-V上微调SAM2模型，得到SAM2-Matte，其在野外视频上的鲁棒性优于在现有抠图数据集上训练的同一模型。这些发现强调了大规模伪标签视频抠图的重要性，并展示了生成先验和可访问的分割提示如何推动视频抠图研究的可扩展进展。

Summary / 总结

VideoMaMa is a model that converts coarse segmentation masks into pixel-accurate alpha mattes for real-world videos by utilizing pretrained video diffusion models. It demonstrates strong zero-shot generalization to real-world footage, trained solely on synthetic data. The model's effectiveness is validated through fine-tuning on the MA-V dataset, which offers high-quality matting annotations for over 50K real-world videos, leading to better performance on in-the-wild videos compared to existing datasets.

VideoMaMa是一种通过利用预训练的视频扩散模型将粗略的分割掩码转换为像素级准确的alpha mattes的模型。它在仅使用合成数据训练的情况下，对真实世界的视频片段表现出强大的零样本泛化能力。该模型进一步在提供超过50K真实世界视频高质量遮罩注释的MA-V数据集上进行微调，从而在野生视频上的鲁棒性优于在现有遮罩数据集上训练的模型。

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

Authors: Said Taghadouini, Adrien Cavaillès, Baptiste Aubertin

First: 2026-01-20T18:58:32+00:00 · Latest: 2026-01-20T18:58:32+00:00

Abs · PDF · Code1 · Code2

Abstract

We present \textbf{LightOnOCR-2-1B}, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9$\times$ smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and \textbf{LightOnOCR-bbox-bench} evaluation under their respective licenses.

中文标题/摘要

标题：LightOnOCR：一种端到端多语言视觉-语言模型，用于先进的OCR

我们提出了**LightOnOCR-2-1B**，这是一种1B参数的端到端多语言视觉-语言模型，能够将文档图像（例如PDF）转换为干净、自然排序的文本，而无需脆弱的OCR管道。该模型在大规模高质量的蒸馏混合数据集上进行训练，该数据集涵盖了扫描文档、法语文档和科学PDF的广泛覆盖。LightOnOCR-2在OlmOCR-Bench上达到了最先进的结果，同时比之前表现最好的模型小9倍，并且速度快得多。我们进一步扩展了输出格式，预测嵌入图像的归一化边界框，在预训练中通过恢复策略引入定位，并使用基于IoU的奖励进行RLVR细化。最后，我们通过检查点平均和任务算术合并提高了鲁棒性。我们以Apache 2.0许可证发布模型检查点，并在各自的许可证下公开了数据集和**LightOnOCR-bbox-bench**评估。

OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer

Authors: Pengze Zhang, Yanze Wu, Mengtian Li, Xu Bai, Songtao Zhao, Fulong Ye, Chong Mou, Xinghui Li, Zhuowei Chen, Qian He, Mingyuan Gao

First: 2026-01-20T18:58:11+00:00 · Latest: 2026-01-20T18:58:11+00:00

Comments: Github Page: https://pangzecheung.github.io/OmniTransfer/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Videos convey richer information than images or text, capturing both spatial and temporal dynamics. However, most existing video customization methods rely on reference images or task-specific temporal priors, failing to fully exploit the rich spatio-temporal information inherent in videos, thereby limiting flexibility and generalization in video generation. To address these limitations, we propose OmniTransfer, a unified framework for spatio-temporal video transfer. It leverages multi-view information across frames to enhance appearance consistency and exploits temporal cues to enable fine-grained temporal control. To unify various video transfer tasks, OmniTransfer incorporates three key designs: Task-aware Positional Bias that adaptively leverages reference video information to improve temporal alignment or appearance consistency; Reference-decoupled Causal Learning separating reference and target branches to enable precise reference transfer while improving efficiency; and Task-adaptive Multimodal Alignment using multimodal semantic guidance to dynamically distinguish and tackle different tasks. Extensive experiments show that OmniTransfer outperforms existing methods in appearance (ID and style) and temporal transfer (camera movement and video effects), while matching pose-guided methods in motion transfer without using pose, establishing a new paradigm for flexible, high-fidelity video generation.

中文标题/摘要

标题：OmniTransfer：一站式时空视频转移框架

视频比图像或文本传达更多信息，能够捕捉空间和时间动态。然而，大多数现有的视频定制方法依赖于参考图像或特定任务的时间先验，未能充分利用视频中固有的丰富时空信息，从而限制了视频生成的灵活性和泛化能力。为了解决这些限制，我们提出了OmniTransfer，这是一种统一的时空视频转移框架。它利用跨帧的多视角信息来增强外观一致性，并利用时间线索实现精细的时间控制。为了统一各种视频转移任务，OmniTransfer 包含三个关键设计：任务感知位置偏差，能够自适应地利用参考视频信息以提高时间对齐或外观一致性；参考解耦因果学习，将参考分支和目标分支分离，以实现精确的参考转移并提高效率；以及任务自适应多模态对齐，使用多模态语义指导动态区分和解决不同的任务。大量实验表明，OmniTransfer 在外观（身份和风格）和时间转移（摄像机运动和视频效果）方面优于现有方法，同时在不使用姿态的情况下与姿态引导方法在运动转移方面保持一致，从而建立了灵活、高保真视频生成的新范式。

Summary / 总结

OmniTransfer is a unified framework for spatio-temporal video transfer that addresses the limitations of existing methods by leveraging multi-view information and temporal cues. It incorporates three key designs: Task-aware Positional Bias for better temporal alignment and appearance consistency, Reference-decoupled Causal Learning for precise reference transfer and efficiency, and Task-adaptive Multimodal Alignment for dynamic task handling. Experimental results demonstrate that OmniTransfer outperforms existing methods in appearance and temporal transfer tasks, and matches pose-guided methods in motion transfer without using pose, setting a new standard for flexible and high-fidelity video generation.

OmniTransfer 是一个统一的时空视频转移框架，通过利用多视图信息和时间线索来解决现有方法的局限性。它包含三个关键设计：任务感知位置偏差、参考解耦因果学习和任务自适应多模态对齐。实验表明，OmniTransfer 在外观和时间转移方面优于现有方法，而在运动转移方面与使用姿态的方法相当，无需使用姿态，展示了灵活和高保真视频生成的新范式。

Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment

Authors: Yuming Yang, Mingyoung Lai, Wanxu Zhao, Xiaoran Fan, Zhiheng Xi, Mingqi Wu, Chiyue Huang, Jun Zhao, Haijun Lv, Jian Tong, Yunhua Zhou, Yicheng Zou, Qipeng Guo, Tao Gui, Qi Zhang, Xuanjing Huang

First: 2026-01-20T18:58:10+00:00 · Latest: 2026-01-20T18:58:10+00:00

Comments: 26 pages. Project page: https://github.com/UmeanNever/RankSurprisalRatio

Abs · PDF · Code1 · Code2 · Code3

Abstract

Long chain-of-thought (CoT) trajectories provide rich supervision signals for distilling reasoning from teacher to student LLMs. However, both prior work and our experiments show that trajectories from stronger teachers do not necessarily yield better students, highlighting the importance of data-student suitability in distillation. Existing methods assess suitability primarily through student likelihood, favoring trajectories that closely align with the model's current behavior but overlooking more informative ones. Addressing this, we propose Rank-Surprisal Ratio (RSR), a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory. RSR is motivated by the observation that effective trajectories typically combine low absolute probability with relatively high-ranked tokens under the student model, balancing learning signal strength and behavioral alignment. Concretely, RSR is defined as the ratio of a trajectory's average token-wise rank to its average negative log-likelihood, and is straightforward to compute and interpret. Across five student models and reasoning trajectories from 11 diverse teachers, RSR strongly correlates with post-training performance (average Spearman 0.86), outperforming existing metrics. We further demonstrate its practical utility in both trajectory selection and teacher selection.

中文标题/摘要

标题：哪种推理轨迹能更好地教授学生推理？一种简单的信息对齐度量

长链推理（CoT）轨迹为从教师到学生大语言模型提炼推理提供了丰富的监督信号。然而，先前的工作和我们的实验表明，来自更强教师的轨迹并不一定产生更好的学生，突显了数据-学生适合性在提炼中的重要性。现有方法主要通过学生似然性来评估适合性，倾向于那些与模型当前行为紧密对齐的轨迹，而忽视了更具信息性的轨迹。为解决这一问题，我们提出了一种简单度量——排名惊异比（RSR），该度量同时捕捉对齐和信息性来评估推理轨迹的适合性。RSR 的动机是观察到有效的轨迹通常结合了低绝对概率和相对高排名的令牌，平衡了学习信号强度和行为对齐。具体而言，RSR 定义为轨迹的平均令牌排名与其平均负对数似然的比率，易于计算和解释。在五个学生模型和来自 11 个不同教师的推理轨迹上，RSR 与训练后性能（平均斯皮尔曼相关系数 0.86）高度相关，优于现有度量。我们进一步展示了其在轨迹选择和教师选择中的实际应用价值。

Summary / 总结

The study aims to evaluate the effectiveness of reasoning trajectories in teaching reasoning skills to student language models (LLMs) by proposing Rank-Surprisal Ratio (RSR), a metric that balances alignment and informativeness. RSR is found to strongly correlate with post-training performance (average Spearman 0.86) across various student models and teacher trajectories, outperforming existing methods. The metric is simple to compute and interpret, and its practical utility is demonstrated in trajectory and teacher selection.

本文探讨了从教师到学生语言模型传递推理能力时选择有效推理轨迹的挑战。它引入了Rank-Surprisal Ratio (RSR)这一度量标准，该标准平衡了对齐和信息量，优于现有方法在预测后训练性能方面的表现。实验结果显示，RSR与更好的性能（平均Spearman相关性0.86）高度相关。

Soft Tail-dropping for Adaptive Visual Tokenization

Authors: Zeyuan Chen, Kai Zhang, Zhuowen Tu, Yuanjun Xiong

First: 2026-01-20T18:57:19+00:00 · Latest: 2026-01-20T18:57:19+00:00

Abs · PDF · Code1 · Code2

Abstract

We present Soft Tail-dropping Adaptive Tokenizer (STAT), a 1D discrete visual tokenizer that adaptively chooses the number of output tokens per image according to its structural complexity and level of detail. STAT encodes an image into a sequence of discrete codes together with per-token keep probabilities. Beyond standard autoencoder objectives, we regularize these keep probabilities to be monotonically decreasing along the sequence and explicitly align their distribution with an image-level complexity measure. As a result, STAT produces length-adaptive 1D visual tokens that are naturally compatible with causal 1D autoregressive (AR) visual generative models. On ImageNet-1k, equipping vanilla causal AR models with STAT yields competitive or superior visual generation quality compared to other probabilistic model families, while also exhibiting favorable scaling behavior that has been elusive in prior vanilla AR visual generation attempts.

中文标题/摘要

标题：软尾部丢弃以适应视觉分词

我们提出了软尾部丢弃自适应分词器（STAT），这是一种1D离散视觉分词器，能够根据图像的结构复杂性和细节水平自适应地选择每个图像的输出分词数量。STAT将图像编码为一系列离散代码，同时附带每个分词的保留概率。除了标准的自编码目标之外，我们还正则化这些保留概率，使其沿序列单调递减，并明确地与图像级别的复杂度度量对齐。因此，STAT生成长度自适应的1D视觉分词，这些分词自然与因果1D自回归（AR）视觉生成模型兼容。在ImageNet-1k上，将STAT与传统的因果AR模型结合使用，其视觉生成质量与其它概率模型家族相当或更优，同时表现出以前传统的AR视觉生成尝试中难以实现的有利扩展行为。

Summary / 总结

The research introduces Soft Tail-dropping Adaptive Tokenizer (STAT), which adaptively tokenizes images based on their complexity. STAT uses a 1D discrete visual tokenizer that outputs a sequence of tokens with per-token keep probabilities, which are regularized to be monotonically decreasing. This method improves the compatibility of causal 1D autoregressive models for image generation, achieving competitive or superior quality on ImageNet-1k compared to other probabilistic models, with better scaling behavior.

研究引入了Soft Tail-dropping Adaptive Tokenizer (STAT)，该方法根据图像的复杂性自适应地对图像进行分词。STAT 使用单调递减的保留概率将图像编码为一系列离散代码，并与图像复杂性对齐。这种方法增强了因果自回归模型在图像生成中的性能，在ImageNet-1k上与其它概率模型相比，取得了竞争力或更优的结果，并且具有更好的扩展行为。

Jet-RL: Enabling On-Policy FP8 Reinforcement Learning with Unified Training and Rollout Precision Flow

Authors: Haocheng Xi, Charlie Ruan, Peiyuan Liao, Yujun Lin, Han Cai, Yilong Zhao, Shuo Yang, Kurt Keutzer, Song Han, Ligeng Zhu

First: 2026-01-20T18:54:31+00:00 · Latest: 2026-01-20T18:54:31+00:00

Comments: 11 pages, 6 figures, 4 tables

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) is essential for enhancing the complex reasoning capabilities of large language models (LLMs). However, existing RL training pipelines are computationally inefficient and resource-intensive, with the rollout phase accounting for over 70% of total training time. Quantized RL training, particularly using FP8 precision, offers a promising approach to mitigating this bottleneck. A commonly adopted strategy applies FP8 precision during rollout while retaining BF16 precision for training. In this work, we present the first comprehensive study of FP8 RL training and demonstrate that the widely used BF16-training + FP8-rollout strategy suffers from severe training instability and catastrophic accuracy collapse under long-horizon rollouts and challenging tasks. Our analysis shows that these failures stem from the off-policy nature of the approach, which introduces substantial numerical mismatch between training and inference. Motivated by these observations, we propose Jet-RL, an FP8 RL training framework that enables robust and stable RL optimization. The key idea is to adopt a unified FP8 precision flow for both training and rollout, thereby minimizing numerical discrepancies and eliminating the need for inefficient inter-step calibration. Extensive experiments validate the effectiveness of Jet-RL: our method achieves up to 33% speedup in the rollout phase, up to 41% speedup in the training phase, and a 16% end-to-end speedup over BF16 training, while maintaining stable convergence across all settings and incurring negligible accuracy degradation.

中文标题/摘要

标题：Jet-RL：统一训练和回放精度流的FP8强化学习

强化学习（RL）对于提升大型语言模型（LLMs）的复杂推理能力至关重要。然而，现有的RL训练管道在计算效率和资源消耗方面存在瓶颈，回放阶段占总训练时间的70%以上。使用FP8精度的量化RL训练提供了一种有前景的方法来缓解这一瓶颈。一种常用策略是在回放中使用FP8精度，而在训练中保留BF16精度。在本文中，我们首次全面研究了FP8 RL训练，并证明了广泛使用的BF16训练+FP8回放策略在长时回放和具有挑战性的任务中会遭受严重的训练不稳定性和灾难性的准确率崩溃。我们的分析表明，这些失败源于该方法的非策略性，这在训练和推理之间引入了巨大的数值不匹配。受这些观察的启发，我们提出了Jet-RL，这是一种FP8 RL训练框架，能够实现稳健和稳定的RL优化。关键思想是采用统一的FP8精度流，用于训练和回放，从而最小化数值差异并消除不必要的跨步骤校准需求。广泛的实验验证了Jet-RL的有效性：我们的方法在回放阶段实现了高达33%的加速，在训练阶段实现了高达41%的加速，相对于BF16训练实现了16%的整体加速，同时在所有设置中保持稳定的收敛，并且几乎不降低准确率。

Summary / 总结

This paper addresses the inefficiency of existing reinforcement learning (RL) training pipelines by proposing Jet-RL, an FP8 RL training framework. The method unifies precision flow for both training and rollout phases, reducing numerical discrepancies and improving stability. Experiments show that Jet-RL achieves up to 41% speedup in training, 33% speedup in rollout, and a 16% end-to-end speedup over BF16 training, while maintaining stable convergence and negligible accuracy degradation. The study highlights the limitations of the BF16-training + FP8-rollout strategy and demonstrates the benefits of a unified FP8 precision approach.

该研究针对强化学习（RL）训练管道中的计算效率问题，特别是占总训练时间超过70%的rollout阶段。研究引入了Jet-RL框架，该框架统一了训练和rollout的精度流，减少了数值差异并提高了稳定性。实验结果显示，Jet-RL在rollout阶段可实现最高33%的加速，在训练阶段可实现最高41%的加速，整体加速16%，同时保持了稳定的收敛性和微乎其微的准确性下降。关键发现是，统一的FP8精度流在训练和rollout过程中提高了效率和稳定性。

APEX-Agents

Authors: Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, Neel Venugopal, Alannah Hsia, Isaac Robinson, Calix Huang, Olivia Varones, Daniyal Khan, Michael Haines, Zach Richards, Chirag Mahapatra, Brendan Foody, Osvald Nitski

First: 2026-01-20T18:53:44+00:00 · Latest: 2026-01-20T18:53:44+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.

中文标题/摘要

标题：APEX-Agents

我们介绍了代理人工智能生产力指数（APEX-Agents），这是一个基准测试，用于评估AI代理是否能够执行由投资银行分析师、管理咨询顾问和公司律师创建的长期跨应用任务。APEX-Agents 要求代理在包含文件和工具的现实工作环境中导航。我们使用 Pass@1 测试了八种代理以确定排行榜。Gemini 3 Flash（思考=高）获得最高分为 24.0%，其次是 GPT-5.2（思考=高）、Claude Opus 4.5（思考=高）和 Gemini 3 Pro（思考=高）。我们开源了包含 480 个测试案例的 APEX-Agents 基准测试，包括所有提示、评分标准、黄金输出、文件和元数据。我们还开源了我们的代理执行和评估基础设施 Archipelago。

Summary / 总结

The research introduces APEX-Agents, a benchmark to evaluate AI agents' ability to perform long-term, cross-application tasks as done by professionals in investment banking, management consulting, and corporate law. The study uses Pass@1 to assess the agents, with Gemini 3 Flash achieving the highest score of 24.0%, followed by GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro. The benchmark includes 480 tasks, along with prompts, rubrics, gold outputs, and metadata, and is open-sourced along with Archipelago, the evaluation infrastructure.

研究引入了APEX-Agents基准，用于评估AI代理执行投资银行、管理咨询和公司律师等专业人士所进行的长期跨应用任务的能力。研究使用Pass@1进行评估，Gemini 3 Flash获得最高分24.0%，其次是GPT-5.2、Claude Opus 4.5和Gemini 3 Pro。基准包括480个任务，附带提示、评分标准、黄金输出和元数据，并且已开源，同时开源了用于评估的基础设施Archipelago。

Spatiotemporal Wildfire Prediction and Reinforcement Learning for Helitack Suppression

Authors: Shaurya Mathur, Shreyas Bellary Manjunath, Nitin Kulkarni, Alina Vereshchaka

Venue: www

First: 2026-01-20T18:50:12+00:00 · Latest: 2026-01-20T18:50:12+00:00

Comments: 6 pages, 5 figures (two of them in tables), Conference: IEEE International Conference on Machine Learning and Applications 2025 (ICMLA 2025): https://www.icmla-conference.org/icmla25/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Wildfires are growing in frequency and intensity, devastating ecosystems and communities while causing billions of dollars in suppression costs and economic damage annually in the U.S. Traditional wildfire management is mostly reactive, addressing fires only after they are detected. We introduce \textit{FireCastRL}, a proactive artificial intelligence (AI) framework that combines wildfire forecasting with intelligent suppression strategies. Our framework first uses a deep spatiotemporal model to predict wildfire ignition. For high-risk predictions, we deploy a pre-trained reinforcement learning (RL) agent to execute real-time suppression tactics with helitack units inside a physics-informed 3D simulation. The framework generates a threat assessment report to help emergency responders optimize resource allocation and planning. In addition, we are publicly releasing a large-scale, spatiotemporal dataset containing $\mathbf{9.5}$ million samples of environmental variables for wildfire prediction. Our work demonstrates how deep learning and RL can be combined to support both forecasting and tactical wildfire response. More details can be found at https://sites.google.com/view/firecastrl.

中文标题/摘要

标题：时空野火预测与强化学习在直升机灭火中的应用

野火的频率和强度正在增加，破坏生态系统和社区，每年在美国造成数十亿美元的扑灭成本和经济损失。传统的野火管理主要是被动的，只在野火被检测到后才进行应对。我们引入了FireCastRL，这是一种主动的人工智能框架，结合了野火预测和智能灭火策略。该框架首先使用深度时空模型预测野火的点火。对于高风险预测，我们部署了一个预训练的强化学习（RL）代理，在物理信息的3D模拟中实时执行直升机单位的灭火战术。该框架生成威胁评估报告，以帮助应急响应者优化资源分配和规划。此外，我们还公开发布了一个包含950万环境变量样本的大规模时空数据集，用于野火预测。我们的工作展示了深度学习和RL如何结合以支持预测和战术野火应对。更多详情请参见https://sites.google.com/view/firecastrl。

Opportunities in AI/ML for the Rubin LSST Dark Energy Science Collaboration

Authors: LSST Dark Energy Science Collaboration, Eric Aubourg, Camille Avestruz, Matthew R. Becker, Biswajit Biswas, Rahul Biswas, Boris Bolliet, Adam S. Bolton, Clecio R. Bom, Raphaël Bonnet-Guerrini, Alexandre Boucaud, Jean-Eric Campagne, Chihway Chang, Aleksandra Ćiprijanović, Johann Cohen-Tanugi, Michael W. Coughlin, John Franklin Crenshaw, Juan C. Cuevas-Tello, Juan de Vicente, Seth W. Digel, Steven Dillmann, Mariano Javier de León Dominguez Romero, Alex Drlica-Wagner, Sydney Erickson, Alexander T. Gagliano, Christos Georgiou, Aritra Ghosh, Matthew Grayling, Kirill A. Grishin, Alan Heavens, Lindsay R. House, Mustapha Ishak, Wassim Kabalan, Arun Kannawadi, François Lanusse, C. Danielle Leonard, Pierre-François Léget, Michelle Lochner, Yao-Yuan Mao, Peter Melchior, Grant Merz, Martin Millon, Anais Möller, Gautham Narayan, Yuuki Omori, Hiranya Peiris, Laurence Perreault-Levasseur, Andrés A. Plazas Malagón, Nesar Ramachandra, Benjamin Remy, Cécile Roucelle, Jaime Ruiz-Zapatero, Stefan Schuldt, Ignacio Sevilla-Noarbe, Ved G. Shah, Tjitske Starkenburg, Stephen Thorp, Laura Toribio San Cipriano, Tilman Tröster, Roberto Trotta, Padma Venkatraman, Amanda Wasserman, Tim White, Justine Zeghal, Tianqing Zhang, Yuanyuan Zhang

First: 2026-01-20T18:46:42+00:00 · Latest: 2026-01-20T18:46:42+00:00

Comments: 84 pages. This is v1.0 of the DESC's white paper on AI/ML, a collaboration document that is being made public but which is not planned for submission to a journal

Abs · PDF · Code1 · Code2

Abstract

The Vera C. Rubin Observatory's Legacy Survey of Space and Time (LSST) will produce unprecedented volumes of heterogeneous astronomical data (images, catalogs, and alerts) that challenge traditional analysis pipelines. The LSST Dark Energy Science Collaboration (DESC) aims to derive robust constraints on dark energy and dark matter from these data, requiring methods that are statistically powerful, scalable, and operationally reliable. Artificial intelligence and machine learning (AI/ML) are already embedded across DESC science workflows, from photometric redshifts and transient classification to weak lensing inference and cosmological simulations. Yet their utility for precision cosmology hinges on trustworthy uncertainty quantification, robustness to covariate shift and model misspecification, and reproducible integration within scientific pipelines. This white paper surveys the current landscape of AI/ML across DESC's primary cosmological probes and cross-cutting analyses, revealing that the same core methodologies and fundamental challenges recur across disparate science cases. Since progress on these cross-cutting challenges would benefit multiple probes simultaneously, we identify key methodological research priorities, including Bayesian inference at scale, physics-informed methods, validation frameworks, and active learning for discovery. With an eye on emerging techniques, we also explore the potential of the latest foundation model methodologies and LLM-driven agentic AI systems to reshape DESC workflows, provided their deployment is coupled with rigorous evaluation and governance. Finally, we discuss critical software, computing, data infrastructure, and human capital requirements for the successful deployment of these new methodologies, and consider associated risks and opportunities for broader coordination with external actors.

中文标题/摘要

标题：Rubin LSST暗能量科学合作中的AI/ML机遇

Vera C. Rubin天文台的时空遗产巡天（LSST）将产生前所未有的大量异质天文学数据（图像、目录和警报），挑战传统的分析管道。LSST暗能量科学合作（DESC）旨在从这些数据中推导出关于暗能量和暗物质的稳健约束，需要具有统计强大力量、可扩展性和操作可靠性的方法。人工智能和机器学习（AI/ML）已经在DESC的科学工作流中嵌入，从光谱红移和瞬变分类到弱透镜推断和宇宙学模拟。然而，它们在精确宇宙学中的应用取决于可信赖的不确定性量化、对协变量偏移和模型误设的鲁棒性以及在科学管道中可重复集成。本文综述了DESC主要宇宙学探针和跨切面分析中AI/ML的当前状况，揭示了在不同科学案例中核心方法和基本挑战的反复出现。鉴于这些跨切面挑战的进展将同时惠及多个探针，我们确定了关键的方法研究优先事项，包括大规模贝叶斯推断、物理信息方法、验证框架和发现中的主动学习。我们还探讨了最新基础模型方法和LLM驱动的自主AI系统在DESC工作流中的潜在应用，前提是其部署必须与严格的评估和治理相结合。最后，我们讨论了成功部署这些新方法所需的软件、计算、数据基础设施和人力资源需求，并考虑了与外部行为者更广泛协调的关联风险和机遇。

Summary / 总结

The Rubin LSST Dark Energy Science Collaboration (DESC) aims to derive robust constraints on dark energy and dark matter using AI/ML methods, which are already integrated into various science workflows. The key challenges include trustworthy uncertainty quantification, robustness to covariate shift, and reproducible integration within pipelines. The paper identifies methodological research priorities such as Bayesian inference at scale, physics-informed methods, validation frameworks, and active learning. It also explores the potential of foundation models and LLM-driven AI systems, emphasizing the need for rigorous evaluation and governance.

Vera C. Rubin天文台的LSST将生成大量天文学数据，需要先进的AI/ML方法进行暗能量和暗物质分析。DESC在各种工作流中使用AI/ML，但面临不确定性量化和模型稳健性方面的挑战。该论文确定了关键的研究优先事项，包括可扩展的贝叶斯推断和物理导向的方法，并探讨了基础模型和LLM驱动的AI系统的潜在应用。同时，它还强调了需要强大的软件、计算和数据基础设施来支持这些方法，并讨论了与外部参与者更广泛协调的机遇和风险。

KAGE-Bench: Fast Known-Axis Visual Generalization Evaluation for Reinforcement Learning

Authors: Egor Cherepanov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov

First: 2026-01-20T18:44:28+00:00 · Latest: 2026-01-20T18:44:28+00:00

Comments: 38 pages, 44 figures, 3 tables

Abs · PDF · Code1 · Code2 · Project1

Abstract

Pixel-based reinforcement learning agents often fail under purely visual distribution shift even when latent dynamics and rewards are unchanged, but existing benchmarks entangle multiple sources of shift and hinder systematic analysis. We introduce KAGE-Env, a JAX-native 2D platformer that factorizes the observation process into independently controllable visual axes while keeping the underlying control problem fixed. By construction, varying a visual axis affects performance only through the induced state-conditional action distribution of a pixel policy, providing a clean abstraction for visual generalization. Building on this environment, we define KAGE-Bench, a benchmark of six known-axis suites comprising 34 train-evaluation configuration pairs that isolate individual visual shifts. Using a standard PPO-CNN baseline, we observe strong axis-dependent failures, with background and photometric shifts often collapsing success, while agent-appearance shifts are comparatively benign. Several shifts preserve forward motion while breaking task completion, showing that return alone can obscure generalization failures. Finally, the fully vectorized JAX implementation enables up to 33M environment steps per second on a single GPU, enabling fast and reproducible sweeps over visual factors. Code: https://avanturist322.github.io/KAGEBench/.

中文标题/摘要

标题：KAGE-Bench：快速已知轴视觉泛化评估框架用于强化学习

基于像素的强化学习代理在纯视觉分布变化下经常失败，即使潜在动力学和奖励未变，但现有基准将多种变化源混杂在一起，妨碍了系统的分析。我们引入了KAGE-Env，这是一个JAX原生的2D平台游戏，将观察过程分解为独立可控的视觉轴，而底层控制问题保持不变。通过构造，改变视觉轴仅通过像素策略的状态条件动作分布影响性能，提供了一个清晰的视觉泛化抽象。在此基础上，我们定义了KAGE-Bench，一个包含六个已知轴套件的基准，共有34个训练-评估配置对，以隔离个体视觉变化。使用标准的PPO-CNN基线，我们观察到强烈的轴依赖性失败，背景和光度变化通常导致失败，而代理外观变化相对无害。某些变化在保持前向运动的同时破坏任务完成，表明仅凭回报可能掩盖泛化失败。最后，完全向量化JAX实现使单个GPU每秒可达到3300万环境步骤，从而实现视觉因素的快速和可重复扫描。代码：https://avanturist322.github.io/KAGEBench/

Summary / 总结

The research aims to evaluate visual generalization in reinforcement learning agents by isolating different visual factors. KAGE-Env, a JAX-native platformer, is introduced to factorize the observation process into controllable visual axes while keeping the control problem constant. KAGE-Bench, a benchmark of six known-axis suites, is defined to isolate individual visual shifts. Experiments with a PPO-CNN baseline reveal strong axis-dependent failures, particularly for background and photometric shifts, while agent-appearance shifts are less problematic. The study also highlights that return alone can mask generalization failures, and the JAX implementation allows for fast and reproducible experiments.

研究针对像素基强化学习代理在视觉分布变化下失效的问题，而潜在动力学和奖励保持不变。研究引入了KAGE-Env，这是一个JAX原生的2D平台游戏，将观察过程分解为可独立控制的视觉轴，以系统分析视觉泛化。KAGE-Bench基准测试包括六个已知轴套件的34个训练-评估配置对，揭示了强烈的轴依赖性失败，特别是在背景和光度变化方面，而代理外观变化的影响较小。研究还指出，仅凭回报可能掩盖泛化失败，并且JAX实现允许快速和可重复的实验。

MASCOT: Towards Multi-Agent Socio-Collaborative Companion Systems

Authors: Yiyang Wang, Yiqiao Jin, Alex Cabral, Josiah Hester

First: 2026-01-20T18:44:04+00:00 · Latest: 2026-01-20T18:44:04+00:00

Comments: 15 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

Multi-agent systems (MAS) have recently emerged as promising socio-collaborative companions for emotional and cognitive support. However, these systems frequently suffer from persona collapse--where agents revert to generic, homogenized assistant behaviors--and social sycophancy, which produces redundant, non-constructive dialogue. We propose MASCOT, a generalizable framework for multi-perspective socio-collaborative companions. MASCOT introduces a novel bi-level optimization strategy to harmonize individual and collective behaviors: 1) Persona-Aware Behavioral Alignment, an RLAIF-driven pipeline that finetunes individual agents for strict persona fidelity to prevent identity loss; and 2) Collaborative Dialogue Optimization, a meta-policy guided by group-level rewards to ensure diverse and productive discourse. Extensive evaluations across psychological support and workplace domains demonstrate that MASCOT significantly outperforms state-of-the-art baselines, achieving improvements of up to +14.1 in Persona Consistency and +10.6 in Social Contribution. Our framework provides a practical roadmap for engineering the next generation of socially intelligent multi-agent systems.

中文标题/摘要

标题：MASCOT：迈向多代理社会协作伴侣系统

多代理系统（MAS）最近作为情感和认知支持的社会协作伴侣显示出巨大的潜力。然而，这些系统经常遭受个性崩塌——代理会退化为通用、同质化的助手行为——以及社交奉承，这会产生冗余且非建设性的对话。我们提出了MASCOT，一种多视角社会协作伴侣的一般框架。MASCOT引入了一种新颖的双层优化策略来协调个体和集体行为：1）个性感知行为对齐，一种基于RLAIF的流水线，对个体代理进行微调以严格保持个性，防止身份丢失；2）协作对话优化，由群体级奖励引导的元策略，以确保多样且富有成效的对话。在心理支持和工作场所领域进行的广泛评估表明，MASCOT显著优于最先进的基线，分别在个性一致性上提高了14.1%和在社会贡献上提高了10.6%。我们的框架为工程下一代社会智能多代理系统提供了实用的路线图。

Summary / 总结

The research aims to address the issues of persona collapse and social sycophancy in multi-agent systems (MAS) for socio-collaborative companionship. MASCOT, a bi-level optimization framework, is proposed to enhance individual and collective behaviors. It includes Persona-Aware Behavioral Alignment for maintaining distinct agent personas and Collaborative Dialogue Optimization for fostering diverse and productive conversations. Experiments show MASCOT outperforms existing methods, with up to 14.1% improvement in Persona Consistency and 10.6% in Social Contribution across psychological support and workplace domains.

MASCOT 是一个框架，旨在解决多智能体系统中的人格崩塌和社会奉承问题。它采用了一种双层优化策略，包括基于RLAIF的个性感知行为对齐来保持个体智能体的人格，以及协作对话优化来确保有效的群体互动。评估结果显示，MASCOT 在人格一致性方面提高了最多 14.1%，在社会贡献方面提高了最多 10.6%。

Attention-Based Offline Reinforcement Learning and Clustering for Interpretable Sepsis Treatment

Authors: Punit Kumar, Vaibhav Saran, Divyesh Patel, Nitin Kulkarni, Alina Vereshchaka

Venue: www

First: 2026-01-20T18:41:44+00:00 · Latest: 2026-01-20T18:41:44+00:00

Comments: 8 pages, 6 figures, Conference: IEEE International Conference on Machine Learning and Applications 2025 (ICMLA 2025): https://www.icmla-conference.org/icmla25/

Abs · PDF · Code1 · Code2

Abstract

Sepsis remains one of the leading causes of mortality in intensive care units, where timely and accurate treatment decisions can significantly impact patient outcomes. In this work, we propose an interpretable decision support framework. Our system integrates four core components: (1) a clustering-based stratification module that categorizes patients into low, intermediate, and high-risk groups upon ICU admission, using clustering with statistical validation; (2) a synthetic data augmentation pipeline leveraging variational autoencoders (VAE) and diffusion models to enrich underrepresented trajectories such as fluid or vasopressor administration; (3) an offline reinforcement learning (RL) agent trained using Advantage Weighted Regression (AWR) with a lightweight attention encoder and supported by an ensemble models for conservative, safety-aware treatment recommendations; and (4) a rationale generation module powered by a multi-modal large language model (LLM), which produces natural-language justifications grounded in clinical context and retrieved expert knowledge. Evaluated on the MIMIC-III and eICU datasets, our approach achieves high treatment accuracy while providing clinicians with interpretable and robust policy recommendations.

中文标题/摘要

标题：基于注意力的离线强化学习和聚类以实现可解释的脓毒症治疗

脓毒症仍然是重症监护病房中导致死亡的主要原因之一，及时和准确的治疗决策可以显著影响患者结果。在本文中，我们提出了一种可解释的决策支持框架。我们的系统整合了四个核心组件：(1) 一种基于聚类的分层模块，在ICU入院时将患者分为低、中、高风险组，使用聚类和统计验证；(2) 一种利用变分自编码器（VAE）和扩散模型的合成数据增强管道，以丰富如液体或血管加压素给药等代表性不足的轨迹；(3) 一种使用优势加权回归（AWR）训练的离线强化学习（RL）代理，配备轻量级注意力编码器，并由集成模型支持，以提供保守、安全的治疗建议；(4) 一种由多模态大型语言模型（LLM）驱动的解释生成模块，该模块生成基于临床背景和检索专家知识的自然语言解释。在MIMIC-III和eICU数据集上评估，我们的方法在提供可解释和稳健的治疗建议方面取得了高治疗准确性。

Summary / 总结

This study addresses the challenge of sepsis treatment in intensive care units by proposing an interpretable decision support framework. The framework includes a clustering module for patient stratification, a synthetic data augmentation pipeline, an offline reinforcement learning agent, and a rationale generation module. The offline RL agent uses Advantage Weighted Regression with a lightweight attention encoder and ensemble models to provide conservative and safety-aware treatment recommendations. The approach demonstrates high treatment accuracy and offers interpretable policy recommendations, as evaluated on MIMIC-III and eICU datasets.

研究旨在为重症监护病房中的败血症治疗开发一个可解释的决策支持框架。方法包括将患者按风险分组、增强未充分代表的治疗轨迹、使用AWR和注意力编码器训练离线强化学习代理，并使用多模态大语言模型生成解释。关键发现表明，该方法具有高治疗准确性和可解释的策略建议。

Zebra-Llama: Towards Extremely Efficient Hybrid Models

Authors: Mingyu Yang, Mehdi Rezagholizadeh, Guihong Li, Vikram Appia, Emad Barsoum

First: 2025-05-22T20:39:57+00:00 · Latest: 2026-01-20T18:39:03+00:00

Abs · PDF · Code1 · Code2

Abstract

With the growing demand for deploying large language models (LLMs) across diverse applications, improving their inference efficiency is crucial for sustainable and democratized access. However, retraining LLMs to meet new user-specific requirements is prohibitively expensive and environmentally unsustainable. In this work, we propose a practical and scalable alternative: composing efficient hybrid language models from existing pre-trained models. Our approach, Zebra-Llama, introduces a family of 1B, 3B, and 8B hybrid models by combining State Space Models (SSMs) and Multi-head Latent Attention (MLA) layers, using a refined initialization and post-training pipeline to efficiently transfer knowledge from pre-trained Transformers. Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7-11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher. Moreover, Zebra-Llama dramatically reduces KV cache size -down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively-while preserving 100%, 100%, and >97% of average zero-shot performance on LM Harness tasks. Compared to models like MambaInLLaMA, X-EcoMLA, Minitron, and Llamba, Zebra-Llama consistently delivers competitive or superior accuracy while using significantly fewer tokens, smaller teachers, and vastly reduced KV cache memory. Notably, Zebra-Llama-8B surpasses Minitron-8B in few-shot accuracy by 7% while using 8x fewer training tokens, over 12x smaller KV cache, and a smaller teacher (8B vs. 15B). It also achieves 2.6x-3.8x higher throughput (tokens/s) than MambaInLlama up to a 32k context length. We will release code and model checkpoints upon acceptance.

中文标题/摘要

标题：斑马-羊驼：朝极其高效的混合模型迈进

随着对部署大型语言模型（LLMs）在多种应用中的需求增长，提高其推理效率对于可持续和普及化访问至关重要。然而，重新训练LLMs以满足新的用户特定需求是极其昂贵且环境不可持续的。在本工作中，我们提出了一种实用且可扩展的替代方案：通过现有预训练模型组成高效的混合语言模型。我们的方法Zebra-Llama通过结合状态空间模型（SSMs）和多头潜在注意力（MLA）层，引入了1B、3B和8B的混合模型系列，并通过改进的初始化和后训练管道高效地将知识从预训练变换器中转移出来。Zebra-Llama仅使用7-11B训练令牌（与预训练所需的数万亿令牌相比）和8B教师，实现了变换器级别的准确度，同时接近SSM的效率。此外，Zebra-Llama将KV缓存大小分别减少到1B、3B和8B变体的3.9%、2%和2.73%，同时保留了100%、100%和>97%的平均零样本性能。与MambaInLLaMA、X-EcoMLA、Minitron和Llamba等模型相比，Zebra-Llama在使用更少的令牌、更小的教师和大幅减少的KV缓存内存的情况下，持续提供竞争力或更优的准确度。值得注意的是，Zebra-Llama-8B在少样本准确度上比Minitron-8B高出7%，同时使用8倍少的训练令牌、超过12倍小的KV缓存和更小的教师（8B vs. 15B）。它还实现了2.6-3.8倍更高的吞吐量（每秒令牌数），在32k上下文长度时比MambaInLlama高。我们将在接受后发布代码和模型检查点。

Summary / 总结

Zebra-Llama aims to enhance the efficiency of large language models (LLMs) for sustainable deployment. It proposes a method to create hybrid models by combining State Space Models and Multi-head Latent Attention layers, initialized and fine-tuned with a reduced token count. The resulting 1B, 3B, and 8B hybrid models achieve comparable or superior accuracy to full-scale Transformers while using significantly fewer training tokens, smaller teachers, and less KV cache memory. Notably, Zebra-Llama-8B outperforms Minitron-8B in few-shot accuracy and achieves higher throughput than MambaInLlama up to a 32k context length.

Zebra-Llama旨在提高大型语言模型（LLMs）的效率，以实现可持续部署。它提出了一种通过结合State Space Models和Multi-head Latent Attention层来创建混合模型的方法，并通过减少训练令牌数量进行初始化和微调。最终生成的1B、3B和8B混合模型在准确度上与全规模的Transformer相当或更优，同时使用了更少的训练令牌、更小的教师模型和更少的KV缓存内存。值得注意的是，Zebra-Llama-8B在少量样本准确性上优于Minitron-8B，并且在32k上下文长度下比MambaInLlama的吞吐量高出2.6到3.8倍。

AnyTask: an Automated Task and Data Generation Framework for Advancing Sim-to-Real Policy Learning

Authors: Ran Gong, Xiaohan Zhang, Jinghuan Shang, Maria Vittoria Minniti, Jigarkumar Patel, Valerio Pepe, Riedana Yan, Ahmet Gundogdu, Ivan Kapelyukh, Ali Abbas, Xiaoqiang Yan, Harsh Patel, Laura Herlant, Karl Schmeckpeper

First: 2025-12-19T17:55:48+00:00 · Latest: 2026-01-20T18:25:48+00:00

Comments: 28 pages, 25 figures. The first four authors contributed equally

Abs · PDF · Code1 · Code2

Abstract

Generalist robot learning remains constrained by data: large-scale, diverse, and high-quality interaction data are expensive to collect in the real world. While simulation has become a promising way for scaling up data collection, the related tasks, including simulation task design, task-aware scene generation, expert demonstration synthesis, and sim-to-real transfer, still demand substantial human effort. We present AnyTask, an automated framework that pairs massively parallel GPU simulation with foundation models to design diverse manipulation tasks and synthesize robot data. We introduce three AnyTask agents for generating expert demonstrations aiming to solve as many tasks as possible: 1) ViPR, a novel task and motion planning agent with VLM-in-the-loop Parallel Refinement; 2) ViPR-Eureka, a reinforcement learning agent with generated dense rewards and LLM-guided contact sampling; 3) ViPR-RL, a hybrid planning and learning approach that jointly produces high-quality demonstrations with only sparse rewards. We train behavior cloning policies on generated data, validate them in simulation, and deploy them directly on real robot hardware. The policies generalize to novel object poses, achieving 44% average success across a suite of real-world pick-and-place, drawer opening, contact-rich pushing, and long-horizon manipulation tasks. Our project website is at https://anytask.rai-inst.com .

中文标题/摘要

标题：AnyTask：一种自动化任务和数据生成框架，用于推进从仿真到现实的策略学习

通用型机器人学习仍然受到数据的限制：在现实世界中收集大规模、多样性和高质量的交互数据成本高昂。虽然仿真已成为扩展数据收集规模的一种有前景的方法，但相关的任务，包括仿真任务设计、任务感知场景生成、专家演示合成以及仿真到现实的转移，仍然需要大量的人力投入。我们提出了AnyTask，这是一种自动化框架，将大规模并行GPU仿真与基础模型相结合，设计多样化的操作任务并合成机器人数据。我们介绍了三个AnyTask代理，用于生成尽可能多任务的专家演示：1) ViPR，一种具有VLM在环并行细化的新颖任务和运动规划代理；2) ViPR-Eureka，一种基于生成密集奖励和LLM引导接触采样的强化学习代理；3) ViPR-RL，一种结合规划和学习的混合方法，仅使用稀疏奖励即可生成高质量的演示。我们在生成的数据上训练行为克隆策略，在仿真中验证它们，并直接部署到真实机器人硬件上。策略在新的物体姿态上泛化，实现了在一系列真实世界拾取放置、抽屉打开、接触丰富的推拉和长时操作任务中44%的平均成功率。我们的项目网站为https://anytask.rai-inst.com。

Summary / 总结

AnyTask is an automated framework that uses GPU simulation and foundation models to generate diverse manipulation tasks and robot data. It includes three agents: ViPR for task and motion planning, ViPR-Eureka for reinforcement learning with generated rewards, and ViPR-RL for hybrid planning and learning. The framework trains behavior cloning policies on simulated data and deploys them on real robots, achieving 44% average success across various manipulation tasks in real-world scenarios.

AnyTask 是一个自动化框架，利用 GPU 模拟和基础模型生成多样化的操作任务和机器人数据。它包含三个代理：ViPR 用于任务和运动规划，ViPR-Eureka 用于带有生成奖励的强化学习，ViPR-RL 用于结合规划和学习。该框架在生成的数据上训练行为克隆策略，在模拟中验证，并直接部署到真实机器人硬件上，实现了各种操作任务中 44% 的平均成功率。

HALT: Hallucination Assessment via Latent Testing

Authors: Rohan Bhatnagar, Youran Sun, Chi Andrew Zhang, Yixin Wen, Haizhao Yang

First: 2026-01-20T18:16:10+00:00 · Latest: 2026-01-20T18:16:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Hallucination in large language models (LLMs) can be understood as a failure of faithful readout: although internal representations may encode uncertainty about a query, decoding pressures still yield a fluent answer. We propose lightweight residual probes that read hallucination risk directly from intermediate hidden states of question tokens, motivated by the hypothesis that these layers retain epistemic signals that are attenuated in the final decoding stage. The probe is a small auxiliary network whose computation is orders of magnitude cheaper than token generation and can be evaluated fully in parallel with inference, enabling near-instantaneous hallucination risk estimation with effectively zero added latency in low-risk cases. We deploy the probe as an agentic critic for fast selective generation and routing, allowing LLMs to immediately answer confident queries while delegating uncertain ones to stronger verification pipelines. Across four QA benchmarks and multiple LLM families, the method achieves strong AUROC and AURAC, generalizes under dataset shift, and reveals interpretable structure in intermediate representations, positioning fast internal uncertainty readout as a principled foundation for reliable agentic AI.

中文标题/摘要

标题：HALT: 通过潜在测试评估幻觉

大型语言模型（LLMs）中的幻觉可以理解为忠实读取的失败：尽管内部表示可能对查询存在不确定性，但解码压力仍然会产生流畅的答案。我们提出了一种轻量级的残差探针，可以直接从问题标记的中间隐藏状态中读取幻觉风险，这基于假设这些层保留了在最终解码阶段被减弱的元信号。探针是一个小型辅助网络，其计算量比标记生成便宜数个数量级，并且可以在推理过程中完全并行评估，从而在低风险情况下实现近乎即时的幻觉风险估计，且几乎不会增加额外的延迟。我们将探针部署为一种代理批评家，用于快速选择性生成和路由，使LLMs能够立即回答自信的查询，同时将不确定的查询委托给更强的验证管道。在四个问答基准和多个LLM家族中，该方法实现了强大的AUROC和AURAC，具有数据集转移下的泛化能力，并揭示了中间表示中的可解释结构，将快速内部不确定性读取定位为可靠代理AI的原理性基础。

Summary / 总结

HALT assesses hallucination risk in large language models by reading directly from intermediate hidden states of question tokens using lightweight residual probes. These probes are inexpensive to compute and can be evaluated in parallel with inference, enabling rapid hallucination risk estimation. The method shows strong performance across various QA benchmarks and LLM families, and reveals interpretable structure in intermediate representations, suggesting a principled approach for reliable AI systems.

HALT通过直接读取问题标记的中间隐藏状态来评估大型语言模型中的幻觉风险，使用轻量级的残差探针。这些探针计算成本低廉，并且可以与推理并行评估，从而实现快速的幻觉风险估计。该方法在多个QA基准和LLM家族中表现出色，并揭示了中间表示中的可解释结构，表明这是一种可靠的AI系统的基本原则方法。

InT: Self-Proposed Interventions Enable Credit Assignment in LLM Reasoning

Authors: Matthew Y. R. Yang, Hao Bai, Ian Wu, Gene Yang, Amrith Setlur, Aviral Kumar

First: 2026-01-20T18:15:38+00:00 · Latest: 2026-01-20T18:15:38+00:00

Abs · PDF · Code1 · Code2

Abstract

Outcome-reward reinforcement learning (RL) has proven effective at improving the reasoning capabilities of large language models (LLMs). However, standard RL assigns credit only at the level of the final answer, penalizing entire reasoning traces when the outcome is incorrect and uniformly reinforcing all steps when it is correct. As a result, correct intermediate steps may be discouraged in failed traces, while spurious steps may be reinforced in successful ones. We refer to this failure mode as the problem of credit assignment. While a natural remedy is to train a process reward model, accurately optimizing such models to identify corrective reasoning steps remains challenging. We introduce Intervention Training (InT), a training paradigm in which the model performs fine-grained credit assignment on its own reasoning traces by proposing short, targeted corrections that steer trajectories toward higher reward. Using reference solutions commonly available in mathematical reasoning datasets and exploiting the fact that verifying a model-generated solution is easier than generating a correct one from scratch, the model identifies the first error in its reasoning and proposes a single-step intervention to redirect the trajectory toward the correct solution. We then apply supervised fine-tuning (SFT) to the on-policy rollout up to the point of error concatenated with the intervention, localizing error to the specific step that caused failure. We show that the resulting model serves as a far better initialization for RL training. After running InT and subsequent fine-tuning with RL, we improve accuracy by nearly 14% over a 4B-parameter base model on IMO-AnswerBench, outperforming larger open-source models such as gpt-oss-20b.

中文标题/摘要

标题：InT：自我提议的干预措施在LLM推理中实现信用分配

结果奖励强化学习（RL）已被证明能有效提升大型语言模型（LLMs）的推理能力。然而，标准RL仅在最终答案层面分配信用，当结果错误时惩罚整个推理过程，当结果正确时则均匀强化所有步骤。因此，在失败的推理过程中，正确的中间步骤可能会被抑制，而在成功的推理过程中，虚假步骤可能会被强化。我们将这种失败模式称为信用分配问题。虽然自然的补救措施是训练一个过程奖励模型，但准确优化此类模型以识别纠正的推理步骤仍然具有挑战性。我们引入了干预训练（InT），这是一种训练范式，在这种范式中，模型对其自身的推理过程进行细粒度的信用分配，通过提出简短、有针对性的修正来引导轨迹向更高的奖励方向发展。利用数学推理数据集中通常可用的参考解决方案，并利用验证模型生成的解决方案比从头开始生成正确的解决方案更容易的事实，模型识别其推理中的第一个错误，并提出一个单一步骤的干预措施，将轨迹引导到正确的解决方案。然后，我们对包含错误点及其干预措施的策略进行监督微调（SFT），将错误定位到导致失败的具体步骤。我们展示了这种模型作为RL训练的良好初始化的效果。在对IMO-AnswerBench进行InT和后续的RL微调后，我们比4B参数的基础模型提高了近14%的准确性，超过了更大的开源模型如gpt-oss-20b。

Summary / 总结

The paper addresses the issue of credit assignment in large language models (LLMs) where standard reinforcement learning (RL) methods penalize or reward entire reasoning traces. To solve this, the authors propose Intervention Training (InT), where the model self-corrects its reasoning steps by proposing short interventions to steer towards correct solutions. This method improves accuracy by nearly 14% on the IMO-AnswerBench dataset compared to a 4B-parameter base model, surpassing larger models like gpt-oss-20b.

论文针对大型语言模型（LLMs）中标准强化学习（RL）方法存在的信用分配问题，即整个推理过程要么被惩罚要么被强化，可能导致正确步骤被抑制而错误步骤被强化。为此，作者提出了一种干预训练（InT）方法，该方法让模型自行提出短小、针对性的修正来调整其推理轨迹。通过识别第一个错误并提出单步干预，模型使用监督微调（SFT）来纠正轨迹。这种方法在IMO-AnswerBench数据集上将准确率提高了近14%，超过了4B参数的基础模型，也优于更大的开源模型如gpt-oss-20b。

Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints

Authors: Rotem Gatenyo, Ohad Fried

First: 2026-01-20T18:12:55+00:00 · Latest: 2026-01-20T18:12:55+00:00

Abs · PDF · Code1 · Code2

Abstract

We study zero-shot 3D alignment of two given meshes, using a text prompt describing their spatial relation -- an essential capability for content creation and scene assembly. Earlier approaches primarily rely on geometric alignment procedures, while recent work leverages pretrained 2D diffusion models to model language-conditioned object-object spatial relationships. In contrast, we directly optimize the relative pose at test time, updating translation, rotation, and isotropic scale with CLIP-driven gradients via a differentiable renderer, without training a new model. Our framework augments language supervision with geometry-aware objectives: a variant of soft-Iterative Closest Point (ICP) term to encourage surface attachment and a penetration loss to discourage interpenetration. A phased schedule strengthens contact constraints over time, and camera control concentrates the optimization on the interaction region. To enable evaluation, we curate a benchmark containing diverse categories and relations, and compare against baselines. Our method outperforms all alternatives, yielding semantically faithful and physically plausible alignments.

中文标题/摘要

标题：复制-变换-粘贴：由视觉语言和几何约束引导的零样本对象对齐

我们研究了使用描述两个给定网格空间关系的文本提示来进行零样本3D对齐的问题——这是内容创作和场景组装中的一个基本能力。早期的方法主要依赖于几何对齐过程，而最近的工作则利用预训练的2D扩散模型来建模语言条件下的对象间空间关系。相比之下，我们直接在测试时优化相对姿态，通过可微渲染器使用CLIP驱动的梯度更新平移、旋转和各向同性缩放，而无需训练新的模型。我们的框架通过几何感知的目标增强语言监督：一种软ICP项的变体以鼓励表面附着，以及一个穿透损失以防止相互穿插。分阶段的时间表随着时间的推移加强接触约束，而相机控制则将优化集中在交互区域。为了便于评估，我们整理了一个包含多种类别和关系的基准，并与基线进行比较。我们的方法优于所有替代方案，产生了语义上忠实且物理上合理的对齐。

Summary / 总结

The research aims to achieve zero-shot 3D alignment of two given meshes using a text prompt, which is crucial for content creation and scene assembly. The method directly optimizes the relative pose at test time using CLIP-driven gradients via a differentiable renderer, incorporating geometry-aware objectives such as a soft-ICP term and a penetration loss. The approach outperforms existing methods, producing semantically faithful and physically plausible alignments. A benchmark with diverse categories and relations was created to evaluate the performance, showing superior results compared to baselines.

研究旨在通过文本提示实现两个给定网格的零样本3D对齐，这对于内容创作和场景组装至关重要。方法直接在测试时优化相对姿态，使用CLIP驱动的梯度并通过可微渲染器，结合几何感知目标，如软ICP项和穿透损失。该方法优于现有方法，产生语义上忠实且物理上合理的对齐。创建了一个包含多种类别和关系的基准来评估性能，结果显示优于基线方法。

DiffusionAgent: Navigating Expert Models for Agentic Image Generation

Authors: Jie Qin, Jie Wu, Weifeng Chen, Yueming Lyu

First: 2024-01-18T15:30:58+00:00 · Latest: 2026-01-20T18:02:51+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

In the accelerating era of human-instructed visual content creation, diffusion models have demonstrated remarkable generative potential. Yet their deployment is constrained by a dual bottleneck: semantic ambiguity in diverse prompts and the narrow specialization of individual models. A single diffusion architecture struggles to maintain optimal performance across heterogeneous prompts, while conventional "parse-then-call" pipelines artificially separate semantic understanding from generative execution. To bridge this gap, we introduce DiffusionAgent, a unified, language-model-driven agent that casts the entire "prompt comprehension-expert routing-image synthesis" loop into a agentic framework. Our contributions are three-fold: (1) a tree-of-thought-powered expert navigator that performs fine-grained semantic parsing and zero-shot matching to the most suitable diffusion model via an extensible prior-knowledge tree; (2) an advantage database updated with human-in-the-loop feedback, continually aligning model-selection policy with human aesthetic and semantic preferences; and (3) a fully decoupled agent architecture that activates the optimal generative path for open-domain prompts without retraining or fine-tuning any expert. Extensive experiments show that DiffusionAgent retains high generation quality while significantly broadening prompt coverage, establishing a new performance and generality benchmark for multi-domain image synthesis. The code is available at https://github.com/DiffusionAgent/DiffusionAgent

中文标题/摘要

标题：DiffusionAgent：导航专家模型进行有代理的图像生成

在加速的人工指令视觉内容创作时代，扩散模型展示了显著的生成潜力。然而，它们的部署受到双重瓶颈的限制：多变提示中的语义模糊性和个体模型的狭窄专业化。单一的扩散架构难以在异构提示下保持最佳性能，而传统的“解析-调用”管道人为地将语义理解与生成执行分离。为弥合这一差距，我们引入了DiffusionAgent，这是一种统一的语言模型驱动代理，将整个“提示理解-专家导航-图像合成”循环转化为一个有代理的框架。我们的贡献包括三个方面：(1) 一种基于思维树的专家导航器，进行精细的语义解析和零样本匹配，通过可扩展的先验知识树选择最合适的扩散模型；(2) 一个不断更新的人类在环反馈优势数据库，持续调整模型选择策略以符合人类的审美和语义偏好；以及(3) 一个完全解耦的代理架构，无需重新训练或微调任何专家即可为开放域提示激活最佳生成路径。大量实验表明，DiffusionAgent 在保持高质量生成的同时显著扩展了提示覆盖范围，为多域图像合成建立了新的性能和通用性基准。代码可在 https://github.com/DiffusionAgent/DiffusionAgent 获取

Summary / 总结

DiffusionAgent addresses the limitations of diffusion models in handling diverse prompts and narrow specialization by introducing a unified language-model-driven agent. This agent combines semantic parsing with an extensible prior-knowledge tree and an advantage database updated with human feedback to route prompts to the most suitable diffusion model. Experimental results demonstrate that DiffusionAgent maintains high generation quality and significantly expands prompt coverage, setting a new benchmark for multi-domain image synthesis.

DiffusionAgent通过引入一个统一的语言模型驱动代理来解决扩散模型在处理多样提示和狭窄专业化方面的局限性。该代理结合了语义解析、零样本匹配和可扩展的先验知识树，以导航到最适合的扩散模型。此外，它使用一个通过人工反馈更新的优势数据库来使模型选择与人类偏好保持一致。实验结果表明，DiffusionAgent保持了高质量的生成效果，并显著扩展了提示覆盖范围，为多领域图像合成设定了新的基准。

KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

Authors: Junyoung Park, Dalton Jones, Matthew J Morse, Raghavv Goel, Mingu Lee, Chris Lott

Venue: NeurIPS 2025

First: 2025-04-21T18:12:46+00:00 · Latest: 2026-01-20T17:55:29+00:00

Comments: 37 pages, 19 figures, NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

We demonstrate that geometrically distinctive keys during LLM inference tend to have high attention scores. Based on the phenomenon we propose KeyDiff, a training-free KV cache eviction method based solely on key similarity. Unlike other KV cache eviction methods, KeyDiff can process arbitrarily long prompts within strict resource constraints and efficiently generate responses. We provide a theoretical basis for KeyDiff by relating key diversity with attention scores. These results imply KeyDiff can efficiently identify the most important tokens to retain. Notably KeyDiff does not rely on attention scores, allowing the use of optimized attention mechanisms like FlashAttention. Under a strict memory allowance, we demonstrate the effectiveness of KeyDiff for the Llama and Qwen model families by observing a performance gap of less than 0.04% with 8K cache budget ($\sim$23% KV cache reduction) from the non-evicting baseline on LongBench for Llama 3.1-8B and Llama 3.2-3B. We also observe near baseline performance for Deepseek-R1-Distill-Llama-8B on the Math500 reasoning benchmark and decrease end-to-end inference latency by up to 30% compared to the other token-eviction methods.

中文标题/摘要

标题：KeyDiff：基于键相似性的KV缓存淘汰方法以应对资源受限环境中的长上下文LLM推理

我们证明，在LLM推理过程中，几何上独特的键往往具有较高的注意力分数。基于这一现象，我们提出了KeyDiff，一种仅基于键相似性的无需训练的KV缓存淘汰方法。与其它KV缓存淘汰方法不同，KeyDiff可以在严格的资源限制内处理任意长的提示，并高效生成响应。我们通过将键多样性与注意力分数联系起来，为KeyDiff提供了理论基础。这些结果表明，KeyDiff可以有效地识别需要保留的最重要令牌。值得注意的是，KeyDiff不依赖于注意力分数，允许使用优化的注意力机制如FlashAttention。在严格的内存限制下，我们通过在LongBench上观察到Llama 3.1-8B和Llama 3.2-3B模型家族中与非淘汰基线相比性能差距小于0.04%，且缓存预算为8K（约23%的KV缓存减少）来证明KeyDiff的有效性。我们还观察到，在Deepseek-R1-Distill-Llama-8B上接近基线性能，并将端到端推理延迟降低了高达30%，相比其他令牌淘汰方法。

Summary / 总结

KeyDiff is a training-free KV cache eviction method based on key similarity, designed for long-context LLM inference in resource-constrained environments. It efficiently processes long prompts and generates responses without relying on attention scores, allowing the use of optimized mechanisms like FlashAttention. KeyDiff reduces the KV cache size by 23% with less than 0.04% performance loss on LongBench for Llama and Qwen models, and it decreases end-to-end inference latency by up to 30% compared to other token-eviction methods.

KeyDiff 是一种基于键相似性的训练-free KV 缓存淘汰方法，适用于资源受限环境下的长上下文 LLM 推断。它能够高效处理长输入并生成响应，无需依赖注意力分数，允许使用优化的注意力机制。KeyDiff 在 LongBench 上对 Llama 3.1-8B 和 Llama 3.2-3B 的非淘汰基线相比，性能差距小于 0.04%，KV 缓存大小减少了约 23%，并且与其它 token 淘汰方法相比，端到端推理延迟最多可减少 30%。

Semantic Alignment of Multilingual Knowledge Graphs via Contextualized Vector Projections

Authors: Abhishek Kumar

First: 2025-12-22T11:02:30+00:00 · Latest: 2026-01-20T17:52:30+00:00

Abs · PDF · Code1 · Code2

Abstract

The paper presents our work on cross-lingual ontology alignment system which uses embedding based cosine similarity matching. The ontology entities are made contextually richer by creating descriptions using novel techniques. We use a fine-tuned transformer based multilingual model for generating better embeddings. We use cosine similarity to find positive ontology entities pairs and then apply threshold filtering to retain only highly similar entities. We have evaluated our work on OAEI-2022 multifarm track. We achieve 71% F1 score (78% recall and 65% precision) on the evaluation dataset, 16% increase from best baseline score. This suggests that our proposed alignment pipeline is able to capture the subtle cross-lingual similarities.

中文标题/摘要

标题：多语言知识图谱语义对齐的上下文向量投影方法

论文介绍了我们基于嵌入的余弦相似度匹配的跨语言本体对齐系统的工作。通过使用新颖的技术创建描述，使本体实体更具上下文丰富性。我们使用微调的多语言变压器模型生成更好的嵌入。我们使用余弦相似度找到正本体实体对，然后应用阈值过滤以保留仅高度相似的实体。我们在OAEI-2022多农场赛道上评估了我们的工作。我们在评估数据集上达到了71%的F1分数（召回率为78%，精确率为65%），比最佳基线分数提高了16%。这表明我们提出的对齐流水线能够捕捉到微妙的跨语言相似性。

Summary / 总结

The research aims to improve cross-lingual ontology alignment by enriching ontology entities with contextually rich descriptions and using a fine-tuned transformer-based multilingual model for better embeddings. The method involves cosine similarity matching and threshold filtering to retain highly similar entity pairs. The system achieved an F1 score of 71% on the OAEI-2022 multifarm track, a 16% improvement over the best baseline score, indicating its effectiveness in capturing subtle cross-lingual similarities.

研究旨在通过使用微调的多语言变压器模型生成富有上下文的实体描述来改进跨语言本体对齐。方法使用余弦相似度匹配本体实体，并应用阈值过滤保留高度相似的配对。系统在OAEI-2022多农场赛道上的F1得分为71%（召回率为78%，精确率为65%），比最佳基线得分提高了16%，表明有效捕捉了跨语言的相似性。

Toward Efficient Agents: Memory, Tool learning, and Planning

Authors: Xiaofang Yang, Lijun Li, Heng Zhou, Tong Zhu, Xiaoye Qu, Yuchen Fan, Qianshan Wei, Rui Ye, Li Kang, Yiran Qin, Zhiqiang Kou, Daizong Liu, Qi Li, Ning Ding, Siheng Chen, Jing Shao

First: 2026-01-20T17:51:56+00:00 · Latest: 2026-01-20T17:51:56+00:00

Comments: 35 pages, 200 references

Abs · PDF · Code1 · Code2

Abstract

Recent years have witnessed increasing interest in extending large language models into agentic systems. While the effectiveness of agents has continued to improve, efficiency, which is crucial for real-world deployment, has often been overlooked. This paper therefore investigates efficiency from three core components of agents: memory, tool learning, and planning, considering costs such as latency, tokens, steps, etc. Aimed at conducting comprehensive research addressing the efficiency of the agentic system itself, we review a broad range of recent approaches that differ in implementation yet frequently converge on shared high-level principles including but not limited to bounding context via compression and management, designing reinforcement learning rewards to minimize tool invocation, and employing controlled search mechanisms to enhance efficiency, which we discuss in detail. Accordingly, we characterize efficiency in two complementary ways: comparing effectiveness under a fixed cost budget, and comparing cost at a comparable level of effectiveness. This trade-off can also be viewed through the Pareto frontier between effectiveness and cost. From this perspective, we also examine efficiency oriented benchmarks by summarizing evaluation protocols for these components and consolidating commonly reported efficiency metrics from both benchmark and methodological studies. Moreover, we discuss the key challenges and future directions, with the goal of providing promising insights.

中文标题/摘要

标题：向高效智能体迈进：记忆、工具学习与规划

近年来，人们越来越关注将大型语言模型扩展为智能系统。尽管智能体的有效性不断提高，但对其实际部署至关重要的效率却常常被忽视。因此，本文从智能体的三个核心组件——记忆、工具学习和规划——出发，考虑延迟、令牌、步骤等成本，旨在全面研究智能系统本身的效率。我们回顾了多种不同的近期方法，尽管实现方式不同，但经常遵循一些共同的高层次原则，包括通过压缩和管理限制上下文、设计强化学习奖励以减少工具调用，以及采用受控搜索机制以提高效率，我们对此进行了详细讨论。我们从两种互补的方式定义效率：在固定成本预算下比较有效性，以及在相似有效性水平下比较成本。从这个角度来看，我们还通过效率与成本之间的帕累托前沿来审视效率导向的基准测试，总结了这些组件的评估协议，并汇总了来自基准测试和方法论研究的常用效率指标。此外，我们讨论了关键挑战和未来方向，旨在提供有价值的见解。

Summary / 总结

This paper investigates the efficiency of agentic systems by focusing on memory, tool learning, and planning. It reviews various approaches that aim to reduce costs such as latency and tokens while maintaining effectiveness. The study characterizes efficiency in two ways: effectiveness under a fixed cost budget and cost at a comparable level of effectiveness. It also examines efficiency-oriented benchmarks and discusses key challenges for future research.

本文探讨了通过关注记忆、工具学习和规划来提高代理系统的效率。它回顾了旨在减少延迟和令牌等成本的各种方法，并讨论了如何通过压缩边界上下文、设计强化学习奖励以及采用受控搜索机制来提高效率。该研究从固定成本预算下的有效性以及在相似有效性水平下的成本两个方面来表征效率，并通过效率与成本之间的帕累托前沿来审视这一权衡。此外，还总结了基准和方法论研究中的评估协议和效率指标，并指出了该领域的关键挑战和未来方向。

The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts

Authors: Sangmitra Madhusudan, Kaige Chen, Ali Emami

First: 2025-10-23T13:30:40+00:00 · Latest: 2026-01-20T17:46:36+00:00

Comments: 9 pages (excluding references), accepted to EACL 2026 Main Conference

Abs · PDF · Code1 · Code2

Abstract

When language models correctly parse "The cat that the dog chased meowed," are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences (like "The cat [that the dog chased] meowed") where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.

中文标题/摘要

标题：追猫的狗难倒了模型：测量语言模型何时放弃结构转而使用捷径

当语言模型正确解析“the cat that the dog chased meowed”时，它们是在分析句法结构还是仅仅熟悉狗追猫的情况？尽管进行了广泛的基准测试，但我们缺乏区分结构理解与语义模式匹配的方法。我们引入了CenterBench数据集，包含9,720个关于中心嵌套句子（如“The cat [that the dog chased] meowed”）的阅读理解问题，其中相对从句递归嵌套，从简单到复杂地创建了处理需求。每个句子都有一个句法上相同但语义上不合理的对应句子（例如，邮差开药，医生送信），并有六个测试表面理解、句法依赖性和因果推理的问题。测试六种模型发现，可能句子与不可能句子之间的性能差距随着复杂性的增加而系统性地扩大，模型的中位数差距高达26.8个百分点，量化了它们何时放弃结构分析转而使用语义关联。值得注意的是，语义合理性在关于结果行为的问题上损害了表现，因为遵循因果关系比语义连贯性更重要。推理模型提高了准确性，但它们的推理过程显示了语义捷径、过度思考和拒绝回答。与随复杂性增加其合理性优势系统性扩大的模型不同，人类在语义效果上表现出变化。CenterBench提供了第一个框架，用于识别模型何时从结构分析转向模式匹配。

Summary / 总结

The study aims to evaluate whether language models understand the structure of sentences or rely on familiar semantic patterns. It introduces CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences, testing six models. The results show that performance gaps between plausible and implausible sentences increase with complexity, indicating that models often rely on semantic shortcuts rather than structural analysis. Notably, semantic plausibility negatively impacts performance on questions about resulting actions, where causal relationships are crucial. The study provides insights into when models shift from structural analysis to pattern matching.

研究旨在评估语言模型是理解句法结构还是依赖于熟悉的语义模式。它引入了包含9,720个理解问题的CenterBench数据集，测试了六个模型。结果显示，随着句子复杂性的增加，合理句子和不合理句子之间的性能差距会增大，表明模型往往使用语义捷径而非结构分析。值得注意的是，语义合理性在关于结果动作的问题上会负面影响表现，因为因果关系更为重要。推理模型提高了准确性，但仍使用语义捷径。人类的表现则显示出语义效果的变异性，不同于模型。CenterBench提供了一个框架，用于识别模型何时从结构分析转向模式匹配。

IIR-VLM: In-Context Instance-level Recognition for Large Vision-Language Models

Authors: Liang Shi, Wei Li, Kevin M Beussman, Lin Chen, Yun Fu

First: 2026-01-20T17:45:24+00:00 · Latest: 2026-01-20T17:45:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Instance-level recognition (ILR) concerns distinguishing individual instances from one another, with person re-identification as a prominent example. Despite the impressive visual perception capabilities of modern VLMs, we find their performance on ILR unsatisfactory, often dramatically underperforming domain-specific ILR models. This limitation hinders many practical application of VLMs, e.g. where recognizing familiar people and objects is crucial for effective visual understanding. Existing solutions typically learn to recognize instances one at a time using instance-specific datasets, which not only incur substantial data collection and training costs but also struggle with fine-grained discrimination. In this work, we propose IIR-VLM, a VLM enhanced for In-context Instance-level Recognition. We integrate pre-trained ILR expert models as auxiliary visual encoders to provide specialized features for learning diverse instances, which enables VLMs to learn new instances in-context in a one-shot manner. Further, IIR-VLM leverages this knowledge for instance-aware visual understanding. We validate IIR-VLM's efficacy on existing instance personalization benchmarks. Finally, we demonstrate its superior ILR performance on a challenging new benchmark, which assesses ILR capabilities across varying difficulty and diverse categories, with person, face, pet and general objects as the instances at task.

中文标题/摘要

标题：IIR-VLM：上下文感知实例级识别的大规模视觉语言模型

实例级识别（ILR）关注的是区分单个实例，以人员再识别为例。尽管现代VLMs在视觉感知方面表现出色，但我们发现它们在ILR上的表现令人不满意，经常远逊于专门的ILR模型。这一限制阻碍了许多VLMs的实际应用，例如，在有效视觉理解中识别熟悉的人和物体至关重要。现有解决方案通常使用特定实例的数据集逐个学习实例识别，这不仅增加了数据收集和训练成本，还难以进行细微区分。在本文中，我们提出了IIR-VLM，这是一种增强的VLM，用于上下文感知实例级识别。我们整合了预训练的ILR专家模型作为辅助视觉编码器，以提供专门的特征来学习多样化的实例，从而使VLMs能够以单次学习的方式在上下文中学习新实例。此外，IIR-VLM 利用这些知识进行实例感知的视觉理解。我们通过现有实例个性化基准验证了IIR-VLM 的有效性。最后，我们在一个具有挑战性的新基准上展示了其优越的ILR性能，该基准评估了不同难度和多样类别的ILR能力，其中人员、面部、宠物和通用物体是实例。

Summary / 总结

The research aims to improve the instance-level recognition (ILR) performance of large vision-language models (VLMs) by addressing their underperformance compared to domain-specific models. The method involves enhancing VLMs with pre-trained ILR expert models as auxiliary encoders to enable one-shot learning of new instances in context. Key findings show that IIR-VLM outperforms existing solutions on existing benchmarks and demonstrates superior ILR performance on a new challenging benchmark that evaluates across varying difficulty and diverse categories.

研究旨在通过解决大型视觉语言模型（VLM）在实例级识别（ILR）方面的不足，提升其ILR能力，这些不足使得VLM的表现远逊于专门的ILR模型。方法是将预训练的ILR专家模型作为辅助编码器集成进来，使VLM能够以单次学习的方式学习新的实例。关键发现表明，IIR-VLM在现有基准测试和一个具有挑战性的新基准测试中均表现出色，展示了其在各种类别中的ILR性能提升。

When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation

Authors: Xunyi Jiang, Dingyi Chang, Julian McAuley, Xin Xu

First: 2025-10-08T17:06:07+00:00 · Latest: 2026-01-20T17:39:24+00:00

Comments: Accepted to EACL 2026 Main Conference

Abs · PDF · Code1 · Code2 · Code3

Abstract

The rapid evolution of large language models (LLMs) and the real world has outpaced the static nature of widely used evaluation benchmarks, raising concerns about their reliability for evaluating LLM factuality. While substantial works continue to rely on the popular but old benchmarks, their temporal misalignment with real-world facts and modern LLMs, and their effects on LLM factuality evaluation remain underexplored. Therefore, in this work, we present a systematic investigation of this issue by examining five popular factuality benchmarks and eight LLMs released across different years. An up-to-date fact retrieval pipeline and three metrics are tailored to quantify benchmark aging and its impact on LLM factuality evaluation. Experimental results and analysis illustrate that a considerable portion of samples in the widely used factuality benchmarks are outdated, leading to unreliable assessments of LLM factuality. We hope our work can provide a testbed to assess the reliability of a benchmark for LLM factuality evaluation and inspire more research on the benchmark aging issue. Codes are available in https://github.com/JiangXunyi/BenchAge.

中文标题/摘要

标题：当基准老化时：大型语言模型事实性评估中的时间错位

大型语言模型（LLMs）和现实世界的快速进化超越了广泛使用的静态评估基准的性质，这引发了对其评估LLM事实性的可靠性的担忧。尽管大量研究仍然依赖于流行但过时的基准，但这些基准与现实世界事实和现代LLM的时间错位及其对LLM事实性评估的影响尚未得到充分探索。因此，在这项工作中，我们通过检查五个流行的事实性基准和八个不同年份发布的八种LLM，系统地研究了这一问题。我们定制了一个最新的事实检索管道和三个指标来量化基准老化及其对LLM事实性评估的影响。实验结果和分析表明，广泛使用的事实性基准中相当一部分样本过时，导致对LLM事实性的评估不可靠。我们希望我们的工作能提供一个基准可靠性测试平台，以评估LLM事实性评估的基准可靠性，并激发更多关于基准老化问题的研究。代码可在https://github.com/JiangXunyi/BenchAge/获取。

Summary / 总结

This work investigates the issue of benchmark aging in evaluating the factuality of large language models (LLMs) by examining five popular factuality benchmarks and eight LLMs released across different years. The authors develop an up-to-date fact retrieval pipeline and three metrics to quantify benchmark aging and its impact on LLM factuality evaluation. The results show that a significant portion of samples in widely used factuality benchmarks are outdated, leading to unreliable assessments of LLM factuality. The study aims to provide a testbed for assessing benchmark reliability and inspire more research on the benchmark aging issue.

这项研究通过检查五个流行的事实性基准和八个在不同年份发布的大型语言模型（LLMs），探讨了评估LLMs事实性的基准老化问题。作者开发了一个最新的事实检索管道和三个指标来量化基准老化及其对LLMs事实性评估的影响。结果显示，广泛使用的事实性基准中有相当一部分样本过时，导致对LLMs事实性的评估不可靠。该研究旨在提供一个基准可靠性的测试平台，并激发更多关于基准老化问题的研究。

A model of errors in transformers

Authors: Suvrat Raju, Praneeth Netrapalli

First: 2026-01-20T17:27:03+00:00 · Latest: 2026-01-20T17:27:03+00:00

Comments: 8+17pages

Abs · PDF · Code1 · Code2

Abstract

We study the error rate of LLMs on tasks like arithmetic that require a deterministic output, and repetitive processing of tokens drawn from a small set of alternatives. We argue that incorrect predictions arise when small errors in the attention mechanism accumulate to cross a threshold, and use this insight to derive a quantitative two-parameter relationship between the accuracy and the complexity of the task. The two parameters vary with the prompt and the model; they can be interpreted in terms of an elementary noise rate, and the number of plausible erroneous tokens that can be predicted. Our analysis is inspired by an ``effective field theory'' perspective: the LLM's many raw parameters can be reorganized into just two parameters that govern the error rate. We perform extensive empirical tests, using Gemini 2.5 Flash, Gemini 2.5 Pro and DeepSeek R1, and find excellent agreement between the predicted and observed accuracy for a variety of tasks, although we also identify deviations in some cases. Our model provides an alternative to suggestions that errors made by LLMs on long repetitive tasks indicate the ``collapse of reasoning'', or an inability to express ``compositional'' functions. Finally, we show how to construct prompts to reduce the error rate.

中文标题/摘要

标题：Transformer中的错误模型

我们研究了LLMs在需要确定性输出和重复处理来自小集合替代项的标记的任务（如算术）中的错误率。我们认为，当注意力机制中的小错误累积超过阈值时，会导致错误预测。我们利用这一见解推导出准确性和任务复杂性之间的定量双参数关系。这两个参数随提示和模型而变化；它们可以解释为基本噪声率和可预测的错误标记数量。我们的分析受到“有效场论”视角的启发：LLM的许多原始参数可以重新组织为仅两个参数来控制错误率。我们进行了广泛的实证测试，使用Gemini 2.5 Flash、Gemini 2.5 Pro和DeepSeek R1，并发现预测的准确性和观察到的准确率在多种任务中表现出很好的一致性，尽管在某些情况下也发现了偏差。我们的模型提供了一种替代观点，即LLMs在长重复任务中犯的错误并不表明“推理的崩溃”或无法表达“组合”函数。最后，我们展示了如何构建提示以降低错误率。

Summary / 总结

This study investigates the error rates of large language models (LLMs) on tasks requiring deterministic outputs and repetitive processing. The authors propose that errors accumulate due to small inaccuracies in the attention mechanism, leading to a two-parameter model that predicts accuracy based on task complexity. Extensive empirical tests with Gemini 2.5 Flash, Gemini 2.5 Pro, and DeepSeek R1 show good agreement between predicted and observed accuracy, though some deviations are noted. The model challenges the notion that LLM errors on long repetitive tasks indicate a collapse of reasoning or inability to express compositional functions, and suggests methods to reduce error rates through prompt construction.

该研究探讨了大型语言模型（LLMs）在需要确定性输出和重复处理的任务上的错误率。作者提出，错误是由于注意力机制中的小误差累积造成的，导致任务复杂性和准确性的两参数关系。在Gemini 2.5 Flash、Gemini 2.5 Pro和DeepSeek R1上的实证测试显示，预测的准确性和观察到的准确率之间有很好的一致性，尽管在某些情况下也发现了偏差。该模型表明，错误并不意味着推理的崩溃，而是可管理的噪声率和可能的错误预测。作者还展示了如何构建提示以降低错误率。

Human Values in a Single Sentence: Moral Presence, Hierarchies, and Transformer Ensembles on the Schwartz Continuum

Authors: Víctor Yeste, Paolo Rosso

First: 2026-01-20T17:25:33+00:00 · Latest: 2026-01-20T17:25:33+00:00

Comments: Code: https://github.com/VictorMYeste/human-value-detection, 37 pages, 4 figures,

Abs · PDF · Code1 · Code2 · Code3

Abstract

We study sentence-level identification of the 19 values in the Schwartz motivational continuum as a concrete formulation of human value detection in text. The setting - out-of-context sentences from news and political manifestos - features sparse moral cues and severe class imbalance. This combination makes fine-grained sentence-level value detection intrinsically difficult, even for strong modern neural models. We first operationalize a binary moral presence task ("does any value appear?") and show that it is learnable from single sentences (positive-class F1 $\approx$ 0.74 with calibrated thresholds). We then compare a presence-gated hierarchy to a direct multi-label classifier under matched compute, both based on DeBERTa-base and augmented with lightweight signals (prior-sentence context, LIWC-22/eMFD/MJD lexica, and topic features). The hierarchy does not outperform direct prediction, indicating that gate recall limits downstream gains. We also benchmark instruction-tuned LLMs - Gemma 2 9B, Llama 3.1 8B, Mistral 8B, and Qwen 2.5 7B - in zero-/few-shot and QLoRA setups and build simple ensembles; a soft-vote supervised ensemble reaches macro-F1 0.332, significantly surpassing the best single supervised model and exceeding prior English-only baselines. Overall, in this scenario, lightweight signals and small ensembles yield the most reliable improvements, while hierarchical gating offers limited benefit. We argue that, under an 8 GB single-GPU constraint and at the 7-9B scale, carefully tuned supervised encoders remain a strong and compute-efficient baseline for structured human value detection, and we outline how richer value structure and sentence-in-document context could further improve performance.

中文标题/摘要

标题：一句话中的人类价值观：道德存在、层级结构与在施瓦茨连续统上的变换器集成

我们研究了施瓦茨动机连续统中19种价值观的短语级识别，作为人类价值观文本检测的具体表述。该设置——来自新闻和政治宣言的孤立句子——特征是稀疏的道德线索和严重的类别不平衡。这种组合使得细粒度的短语级价值观检测即使对于强大的现代神经模型也固有地具有挑战性。我们首先将二元道德存在任务（“是否有任何价值观出现？”）具体化，并表明可以从单个句子中学习（正类F1约等于0.74，带有校准的阈值）。然后，在匹配计算资源的情况下，我们将存在门控层次结构与直接多标签分类器进行比较，两者均基于DeBERTa-base，并增强轻量级信号（前一句子上下文、LIWC-22/eMFD/MJD词典和主题特征）。层次结构没有优于直接预测，表明门控召回限制了下游收益。我们还基准测试了指令调优的大规模语言模型——Gemma 2 9B、Llama 3.1 8B、Mistral 8B和Qwen 2.5 7B，在零/少量样本和QLoRA设置中，并构建了简单的集成；软投票监督集成达到宏F1 0.332，显著优于最佳单个监督模型，并超越了之前的仅英语基线。总体而言，在这种情况下，轻量级信号和小型集成提供了最可靠的改进，而层次门控提供的益处有限。我们认为，在8 GB单GPU约束和7-9B规模下，经过仔细调整的监督编码器仍然是结构化人类价值观检测的强大且计算高效的基线，并概述了更丰富价值观结构和句子在文档中的上下文如何进一步提高性能。

Summary / 总结

This study focuses on identifying 19 values from the Schwartz motivational continuum in out-of-context sentences from news and political manifestos, which are challenging due to sparse moral cues and class imbalance. The research first addresses a binary task of detecting any moral presence and then compares a hierarchical model with a direct multi-label classifier. The hierarchical model did not outperform the direct approach, suggesting that the recall of the gate limits its effectiveness. The study also benchmarks instruction-tuned large language models and builds ensembles, finding that a soft-vote supervised ensemble achieved the best macro-F1 score of 0.332, surpassing single models and previous baselines. The findings indicate that lightweight signals and small ensembles are more reliable, while hierarchical gating offers limited benefits under the given constraints.

研究集中在识别来自新闻和政治宣言的单句中施瓦茨动机连续体中的19种价值观，由于稀疏的道德线索和类别不平衡，这极具挑战性。研究首先处理检测任何道德存在的二元任务，然后比较了层次模型与直接多标签分类器，发现层次模型不如直接方法表现好。研究还对指令调优的大语言模型进行了基准测试，并构建了集成模型，结果显示软投票的监督集成模型达到了最佳性能，超越了单模型和先前的基线。研究发现，在此场景中，轻量级信号和小型集成比层次门控更具有效性。

Paper2Rebuttal: A Multi-Agent Framework for Transparent Author Response Assistance

Authors: Qianli Ma, Chang Guo, Zhiheng Tian, Siyu Wang, Jipeng Xiao, Yuanhao Yue, Zhipeng Zhang

First: 2026-01-20T17:23:51+00:00 · Latest: 2026-01-20T17:23:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Writing effective rebuttals is a high-stakes task that demands more than linguistic fluency, as it requires precise alignment between reviewer intent and manuscript details. Current solutions typically treat this as a direct-to-text generation problem, suffering from hallucination, overlooked critiques, and a lack of verifiable grounding. To address these limitations, we introduce $\textbf{RebuttalAgent}$, the first multi-agents framework that reframes rebuttal generation as an evidence-centric planning task. Our system decomposes complex feedback into atomic concerns and dynamically constructs hybrid contexts by synthesizing compressed summaries with high-fidelity text while integrating an autonomous and on-demand external search module to resolve concerns requiring outside literature. By generating an inspectable response plan before drafting, $\textbf{RebuttalAgent}$ ensures that every argument is explicitly anchored in internal or external evidence. We validate our approach on the proposed $\textbf{RebuttalBench}$ and demonstrate that our pipeline outperforms strong baselines in coverage, faithfulness, and strategic coherence, offering a transparent and controllable assistant for the peer review process. Code will be released.

中文标题/摘要

标题：Paper2Rebuttal: 一种透明作者回应辅助的多智能体框架

撰写有效的反驳是一项高风险的任务，不仅需要语言流畅，还需要精确对齐审稿人意图和手稿细节。当前解决方案通常将此视为直接的文本生成问题，容易出现幻觉、遗漏批评和缺乏可验证的依据。为解决这些局限性，我们引入了**RebuttalAgent**，这是第一个将反驳生成重新定义为以证据为中心的规划任务的多智能体框架。我们的系统将复杂的反馈分解为基本的关注点，并通过合成压缩摘要与高保真文本来动态构建混合上下文，同时整合一个自主的按需外部搜索模块来解决需要外部文献解决的关注点。通过在起草前生成可检查的回应计划，**RebuttalAgent** 确保每个论点都明确地锚定在内部或外部证据中。我们在提出的**RebuttalBench**上验证了我们的方法，并证明我们的管道在覆盖面、忠实度和战略连贯性方面优于强大的基线，为同行评审过程提供了一个透明和可控的助手。代码将被发布。

Summary / 总结

The paper addresses the challenges of writing effective rebuttals, which require precise alignment between reviewer intent and manuscript details. It introduces RebuttalAgent, a multi-agent framework that reframes rebuttal generation as an evidence-centric planning task. The system decomposes feedback into atomic concerns, synthesizes compressed summaries, and integrates an external search module to ensure every argument is grounded in evidence. Experiments show that RebuttalAgent outperforms strong baselines in coverage, faithfulness, and strategic coherence, providing a transparent and controllable assistant for the peer review process.

论文通过引入RebuttalAgent框架，解决生成有效反驳的挑战，该框架将任务重新定义为以证据为中心的规划问题。它将反馈分解为基本关切，并使用结合压缩摘要和外部搜索的混合上下文，确保每个论点都基于证据。实验表明，RebuttalAgent在覆盖面、忠实性和战略连贯性方面优于强基线，提供了一个透明且可控的同行评审助手。

Generative Language Models on Nucleotide Sequences of Human Genes

Authors: Musa Nuri Ihtiyar, Arzucan Ozgur

Venue: Scientific Reports, 2024, 14.1: 22204

First: 2023-07-20T06:59:02+00:00 · Latest: 2026-01-20T17:19:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Language models, especially transformer-based ones, have achieved colossal success in NLP. To be precise, studies like BERT for NLU and works like GPT-3 for NLG are very important. If we consider DNA sequences as a text written with an alphabet of four letters representing the nucleotides, they are similar in structure to natural languages. This similarity has led to the development of discriminative language models such as DNABert in the field of DNA-related bioinformatics. To our knowledge, however, the generative side of the coin is still largely unexplored. Therefore, we have focused on the development of an autoregressive generative language model such as GPT-3 for DNA sequences. Since working with whole DNA sequences is challenging without extensive computational resources, we decided to conduct our study on a smaller scale and focus on nucleotide sequences of human genes rather than the whole DNA. This decision has not changed the structure of the problem, as both DNA and genes can be considered as 1D sequences consisting of four different nucleotides without losing much information and without oversimplification. Firstly, we systematically studied an almost entirely unexplored problem and observed that RNNs perform best, while simple techniques such as N-grams are also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural languages. The importance of using real-world tasks beyond classical metrics such as perplexity was noted. In addition, we examined whether the data-hungry nature of these models can be altered by selecting a language with minimal vocabulary size, four due to four different types of nucleotides. The reason for reviewing this was that choosing such a language might make the problem easier. However, in this study, we found that this did not change the amount of data required very much.

中文标题/摘要

标题：生成语言模型在人类基因核苷酸序列上的应用

语言模型，尤其是基于变换器的模型，在自然语言处理（NLP）中取得了巨大的成功。例如，BERT在自然语言理解（NLU）方面和GPT-3在自然语言生成（NLG）方面的研究非常重要。如果我们把DNA序列看作是由四个字母表示的核苷酸组成的文本，它们在结构上类似于自然语言。这种相似性导致了像DNABert这样的判别语言模型在DNA相关的生物信息学领域的发展。然而，据我们所知，生成语言模型这一面仍然很大程度上未被探索。因此，我们专注于开发类似于GPT-3的自回归生成语言模型用于DNA序列。由于处理整个DNA序列需要大量的计算资源，我们决定缩小研究规模，专注于人类基因的核苷酸序列，而不是整个DNA。这一决定并没有改变问题的结构，因为DNA和基因都可以被视为由四种不同核苷酸组成的1D序列，不会丢失太多信息也不会过度简化。首先，我们系统地研究了一个几乎完全未被探索的问题，并观察到循环神经网络（RNNs）表现最佳，而简单的N-gram技术也很有前景。另一个有益之处是学习如何在我们不理解的语言上使用生成模型，而不是自然语言。我们注意到，除了传统的困惑度等经典指标之外，使用真实世界任务的重要性。此外，我们还研究了这些模型的高数据需求是否可以通过选择词汇量最小的语言来改变，由于有四种不同类型的核苷酸，这里选择的是四个。我们审查这一选择的原因是，选择这样的语言可能会使问题更容易。然而，在这项研究中，我们发现这并没有显著改变所需的数据量。

Domain-Adaptation through Synthetic Data: Fine-Tuning Large Language Models for German Law

Authors: Ali Hamza Bashir, Muhammad Rehan Khalid, Kostadin Cvejoski, Jana Birr, Jule Berghaus, Armin Berger, Sandra Halscheidt, Christian Temath, Rafet Sifa, David Berghaus

First: 2026-01-20T17:11:51+00:00 · Latest: 2026-01-20T17:11:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) often struggle in specialized domains such as legal reasoning due to limited expert knowledge, resulting in factually incorrect outputs or hallucinations. This paper presents an effective method for adapting advanced LLMs to German legal question answering through a novel synthetic data generation approach. In contrast to costly human-annotated resources or unreliable synthetic alternatives, our approach systematically produces high-quality, diverse, and legally accurate question-answer pairs directly from authoritative German statutes. Using rigorous automated filtering methods and parameter-efficient fine-tuning techniques, we demonstrate that LLMs adapted with our synthetic dataset significantly outperform their baseline counterparts on German legal question answering tasks. Our results highlight the feasibility of using carefully designed synthetic data as a robust alternative to manual annotation in high-stakes, knowledge-intensive domains.

中文标题/摘要

标题：通过合成数据进行领域适应：大型语言模型在德国法律中的微调

大型语言模型（LLMs）在诸如法律推理等专门领域中常常由于缺乏专家知识而遇到困难，导致事实错误的输出或幻觉。本文提出了一种通过新颖的合成数据生成方法将先进LLMs适应到德国法律问答的有效方法。与昂贵的人工标注资源或不可靠的合成替代品不同，我们的方法系统地从权威的德国法律条文中生成高质量、多样且法律准确的问题-答案对。通过严格的自动化筛选方法和参数高效的微调技术，我们证明使用我们的合成数据集适应的LLMs在德国法律问答任务中显著优于其基线版本。我们的结果强调了在高风险、知识密集型领域中使用精心设计的合成数据作为手动标注的稳健替代方案的可行性。

Summary / 总结

This paper addresses the challenge of adapting large language models (LLMs) to specialized domains like German legal reasoning, where factual accuracy is crucial. The authors propose a method that generates synthetic question-answer pairs from authoritative German statutes, avoiding the need for costly human annotation. Through parameter-efficient fine-tuning, the adapted models show significant improvement over baseline models on German legal question answering tasks, demonstrating the effectiveness of synthetic data in high-stakes, knowledge-intensive domains.

该论文针对大型语言模型（LLMs）在专业领域如德国法律推理中的挑战，这些模型往往会产生不准确或虚构的输出。作者提出了一种方法，直接从权威的德国法律条文中生成合成的问题-答案对，使用自动过滤和参数高效微调技术。实验结果表明，使用该合成数据集微调的LLMs在德国法律问答任务上的表现优于基线模型，展示了在高风险、知识密集型领域使用精心设计的合成数据的有效性。

ConceptCaps -- a Distilled Concept Dataset for Interpretability in Music Models

Authors: Bruno Sienkiewicz, Łukasz Neumann, Mateusz Modrzejewski

First: 2026-01-20T17:04:08+00:00 · Latest: 2026-01-20T17:04:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Concept-based interpretability methods like TCAV require clean, well-separated positive and negative examples for each concept. Existing music datasets lack this structure: tags are sparse, noisy, or ill-defined. We introduce ConceptCaps, a dataset of 23k music-caption-audio triplets with explicit labels from a 200-attribute taxonomy. Our pipeline separates semantic modeling from text generation: a VAE learns plausible attribute co-occurrence patterns, a fine-tuned LLM converts attribute lists into professional descriptions, and MusicGen synthesizes corresponding audio. This separation improves coherence and controllability over end-to-end approaches. We validate the dataset through audio-text alignment (CLAP), linguistic quality metrics (BERTScore, MAUVE), and TCAV analysis confirming that concept probes recover musically meaningful patterns. Dataset and code are available online.

中文标题/摘要

标题：ConceptCaps —— 一种用于音乐模型可解释性的提炼概念数据集

基于概念的可解释性方法如TCAV需要每个概念干净且清晰正负样本。现有音乐数据集缺乏这种结构：标签稀疏、噪音大或定义不明确。我们引入了ConceptCaps，一个包含23000个音乐-描述-音频三元组的数据集，具有来自200个属性分类学的显式标签。我们的管道将语义建模与文本生成分离：VAE学习合理的属性共现模式，微调的LLM将属性列表转换为专业描述，MusicGen合成相应的音频。这种分离在整体方法上提高了连贯性和可控性。我们通过音频-文本对齐（CLAP）、语言质量指标（BERTScore、MAUVE）和TCAV分析验证了数据集，确认概念探针能够恢复音乐上有意义的模式。数据集和代码已在线提供。

Summary / 总结

The research aims to provide a structured dataset for interpretability in music models, addressing the lack of clear positive and negative examples in existing music datasets. The method involves creating ConceptCaps, a dataset of 23k music-caption-audio triplets with explicit labels from a 200-attribute taxonomy. The pipeline includes a VAE for semantic modeling, a fine-tuned LLM for text generation, and MusicGen for audio synthesis. Key findings show that ConceptCaps improves coherence and controllability compared to end-to-end approaches, and TCAV analysis confirms that concept probes recover musically meaningful patterns.

研究旨在提供一个结构化的数据集以提高音乐模型的可解释性，解决现有音乐数据集中缺乏清晰正负样本的问题。方法是创建包含23k音乐-描述-音频三元组的ConceptCaps数据集，并使用200个属性分类学进行明确标注。该管道包括VAE进行语义建模、微调的LLM进行文本生成以及MusicGen进行音频合成。关键发现表明，ConceptCaps相比端到端方法提高了连贯性和可控性，并且TCAV分析证实概念探针能够恢复音乐上有意义的模式。