arXiv 论文速递

Snapshot: 20260201_0324

RedSage: A Cybersecurity Generalist LLM

Authors: Naufal Suryanto, Muzammal Naseer, Pengfei Li, Syed Talal Wasim, Jinhui Yi, Juergen Gall, Paolo Ceravolo, Ernesto Damiani

Venue: ICLR 2026

First: 2026-01-29T18:59:57+00:00 · Latest: 2026-01-29T18:59:57+00:00

Comments: Accepted on ICLR 2026; Project page: https://risys-lab.github.io/RedSage/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain adaptation. To bridge this gap, we curate 11.8B tokens of cybersecurity-focused continual pretraining data via large-scale web filtering and manual collection of high-quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools. Building on this, we design an agentic augmentation pipeline that simulates expert workflows to generate 266K multi-turn cybersecurity samples for supervised fine-tuning. Combined with general open-source LLM data, these resources enable the training of RedSage, an open-source, locally deployable cybersecurity assistant with domain-aware pretraining and post-training. To rigorously evaluate the models, we introduce RedSage-Bench, a benchmark with 30K multiple-choice and 240 open-ended Q&A items covering cybersecurity knowledge, skills, and tool expertise. RedSage is further evaluated on established cybersecurity benchmarks (e.g., CTI-Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization. At the 8B scale, RedSage achieves consistently better results, surpassing the baseline models by up to +5.59 points on cybersecurity benchmarks and +5.05 points on Open LLM Leaderboard tasks. These findings demonstrate that domain-aware agentic augmentation and pre/post-training can not only enhance cybersecurity-specific expertise but also help to improve general reasoning and instruction-following. All models, datasets, and code are publicly available.

中文标题/摘要

标题：RedSage：网络安全通才大语言模型

网络安全操作需要能够支持多样化工作流程而不泄露敏感数据的辅助大语言模型。现有解决方案要么依赖于存在隐私风险的专有API，要么基于缺乏领域适应性的开源模型。为弥合这一差距，我们通过大规模网络过滤和手动收集高质量资源，整理了118亿个与网络安全相关的持续预训练数据令牌，覆盖了28600份文档，涉及框架、攻击技术和安全工具。在此基础上，我们设计了一种代理增强流水线，模拟专家工作流程生成266000个网络安全多轮对话样本，用于监督微调。结合通用开源大语言模型数据，这些资源使我们能够训练出RedSage，这是一个开源、本地可部署的网络安全助手，具有领域感知的预训练和后训练。为了严格评估模型，我们引入了RedSage-Bench基准，包含30000个多项选择题和240个开放式问答项，涵盖网络安全知识、技能和工具专长。RedSage还在建立的网络安全基准（如CTI-Bench、CyberMetric、SECURE）和通用大语言模型基准上进行了评估，以评估其更广泛的泛化能力。在80亿规模下，RedSage在网络安全基准上取得了持续更好的结果，比基线模型在网络安全基准上高出+5.59分，在Open LLM Leaderboard任务上高出+5.05分。这些发现表明，领域感知的代理增强和预/后训练不仅可以增强网络安全特定的专业知识，还可以帮助提高一般推理和指令遵循能力。所有模型、数据集和代码均已公开。

Summary / 总结

RedSage is designed to address the need for a cybersecurity assistant LLM that supports diverse workflows while maintaining data privacy. It leverages 11.8B tokens of cybersecurity-focused data and an agentic augmentation pipeline to generate 266K multi-turn samples. RedSage outperforms baseline models by up to 5.59 points on cybersecurity benchmarks and 5.05 points on general LLM tasks, demonstrating the effectiveness of domain-aware pre/post-training and agentic augmentation. The model is open-source and publicly available.

RedSage旨在支持多样化的网络安全工作流程同时保持数据隐私。它利用了11.8B个网络安全相关的数据令牌和一个代理增强管道生成了266K个多轮对话样本进行微调。RedSage在网络安全基准测试中的表现比基线模型高出最多5.59分，在通用大模型基准测试中的表现高出5.05分，展示了其在网络安全特定领域的专业知识和一般推理能力的增强。

Discovering Hidden Gems in Model Repositories

Authors: Jonathan Kahana, Eliahu Horwitz, Yedid Hoshen

First: 2026-01-29T18:59:55+00:00 · Latest: 2026-01-29T18:59:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Public repositories host millions of fine-tuned models, yet community usage remains disproportionately concentrated on a small number of foundation checkpoints. We investigate whether this concentration reflects efficient market selection or if superior models are systematically overlooked. Through an extensive evaluation of over 2,000 models, we show the prevalence of "hidden gems", unpopular fine-tunes that significantly outperform their popular counterparts. Notably, within the Llama-3.1-8B family, we find rarely downloaded checkpoints that improve math performance from 83.2% to 96.0% without increasing inference costs. However, discovering these models through exhaustive evaluation of every uploaded model is computationally infeasible. We therefore formulate model discovery as a Multi-Armed Bandit problem and accelerate the Sequential Halving search algorithm by using shared query sets and aggressive elimination schedules. Our method retrieves top models with as few as 50 queries per candidate, accelerating discovery by over 50x.

中文标题/摘要

标题：在模型仓库中发现隐藏瑰宝

公共仓库托管着数百万个微调模型，但社区使用却不成比例地集中在少数基础检查点上。我们研究这种集中是否反映了有效的市场选择，还是优秀模型被系统性地忽视了。通过评估超过2000个模型，我们展示了“隐藏瑰宝”的普遍存在，这些不受欢迎的微调模型显著优于其受欢迎的同类。值得注意的是，在Llama-3.1-8B家族中，我们发现很少下载的检查点在不增加推理成本的情况下将数学性能从83.2%提高到96.0%。然而，通过彻底评估每个上传的模型来发现这些模型在计算上是不可行的。因此，我们将模型发现形式化为一个多臂老虎机问题，并通过使用共享查询集和激进的淘汰计划来加速逐步缩减搜索算法。我们的方法在每个候选模型上只需50次查询即可检索到顶级模型，将发现速度加快了超过50倍。

Summary / 总结

The study investigates the prevalence of 'hidden gems' in model repositories, which are unpopular fine-tuned models that significantly outperform popular ones. Through an extensive evaluation of over 2,000 models, the researchers found that within the Llama-3.1-8B family, certain rarely downloaded checkpoints significantly improve math performance. However, exhaustive evaluation of all models is impractical, so they formulated model discovery as a Multi-Armed Bandit problem and accelerated the Sequential Halving search algorithm, reducing the number of queries needed per candidate to as few as 50, thus accelerating discovery by over 50 times.

研究调查了模型仓库中‘隐藏宝石’的普遍存在，这些是不受欢迎但性能显著优于流行模型的微调模型。通过对超过2,000个模型的广泛评估，研究人员发现，在Llama-3.1-8B家族中，很少下载的检查点可以将数学性能从83.2%提高到96.0%，而不增加推理成本。为了高效地发现这些模型，研究人员将模型发现问题表述为一个多臂老虎机问题，并通过使用共享查询集和激进的淘汰计划加速了顺序减半搜索算法，将每个候选模型所需的查询次数减少到最多50次，从而将发现速度提高了50多倍。

Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

Authors: Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu

First: 2026-01-29T18:59:53+00:00 · Latest: 2026-01-29T18:59:53+00:00

Comments: 20 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data

中文标题/摘要

标题：正确实现的混合线性注意力：高效蒸馏与适用于极长上下文的有效架构

混合Transformer架构结合了softmax注意力块和递归神经网络（RNN），在长上下文建模中表现出理想的性能-吞吐量权衡，但其采用和研究受到从头开始大规模预训练成本高昂的阻碍。一些研究表明，可以通过参数转移和知识蒸馏将预训练的softmax注意力块转换为RNN块。然而，这些转移方法需要大量的训练数据（超过100亿个标记），并且生成的混合模型在长上下文性能方面表现不佳，这是混合模型相对于基于Transformer的模型在推理速度上显著加速的场景。在本文中，我们提出了HALO（混合注意力通过层优化），这是一种将Transformer模型蒸馏为RNN-注意力混合模型的管道。然后，我们提出了HypeNet，这是一种通过新颖的位置编码方案（名为HyPE）和各种架构修改实现的具有优越长度泛化的混合架构。我们使用HALO将Qwen3系列转换为HypeNet，实现了与原始Transformer模型相当的性能，同时具有更好的长上下文性能和效率。转换只需要23亿个标记，不到其预训练数据的0.01%

Summary / 总结

This paper addresses the challenge of adopting hybrid Transformer architectures for long-context modeling by presenting HALO, a pipeline for distilling Transformer models into RNN-attention hybrid models. The authors also introduce HypeNet, a hybrid architecture with enhanced long-context performance and efficiency, achieved through a novel position encoding scheme and architectural modifications. The conversion process requires significantly less training data compared to previous methods, making it more practical for large-scale models.

本文通过提出HALO管道，将Transformer模型精简为RNN-注意力混合模型，解决了采用混合Transformer架构进行长上下文建模的挑战。作者还引入了HypeNet，该混合架构通过新颖的位置编码方案和架构修改增强了长上下文性能。使用HALO，Qwen3系列成功转换为HypeNet，实现了与原始模型相当的性能，同时在长上下文效率和性能方面表现出色，仅需2.3B个训练数据样本。

UEval: A Benchmark for Unified Multimodal Generation

Authors: Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu, Zhuang Liu

First: 2026-01-29T18:59:52+00:00 · Latest: 2026-01-29T18:59:52+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.

中文标题/摘要

标题：UEval：统一多模态生成基准

我们介绍了UEval，一个用于评估统一模型的基准，即能够生成图像和文本的模型。UEval包含1000个专家精选的问题，要求模型输出中包含图像和文本，这些问题来源于8个实际任务。我们精选的问题涵盖了广泛的推理类型，从逐步指南到教科书解释。评估开放式的多模态生成并不简单，因为简单的LLM作为评判者的方法可能会忽略细节。不同于以往依赖多模态大型语言模型（MLLMs）来评估图像质量和文本准确性的工作，我们在UEval中设计了一个基于评分标准的评分系统。对于每个问题，参考图像和文本答案被提供给MLLM生成初始评分标准，其中包括多个评估标准，然后由人类专家进一步完善和验证这些评分标准。总共，UEval包含10,417个验证过的评分标准，使其能够实现可扩展和精细的自动评分。UEval对当前的统一模型具有挑战性：GPT-5-Thinking仅得66.4分，而最好的开源模型仅达到49.1分。我们观察到，推理模型通常优于非推理模型，从推理模型向非推理模型转移推理痕迹显著缩小了差距。这表明，对于需要复杂多模态理解和生成的任务，推理可能是重要的。

Summary / 总结

UEval is a benchmark for evaluating unified models that generate both images and text, comprising 1,000 expert-curated questions from 8 real-world tasks. It uses a rubric-based scoring system, where reference images and text are evaluated by a MLLM to generate initial criteria, which are then refined by human experts. UEval challenges current unified models, with GPT-5-Thinking scoring 66.4 and the best open-source model scoring 49.1. The study finds that reasoning models outperform non-reasoning ones, and transferring reasoning from a reasoning model to a non-reasoning model can significantly improve performance.

UEval 是一个用于评估生成图像和文本的统一模型的基准，包含来自 8 个真实任务的 1,000 个专家精选问题。它采用由人类专家验证的评分系统来评估图像和文本输出。当前的统一模型表现不佳，GPT-5-Thinking 的得分为 66.4 分，而最好的开源模型得分为 49.1 分。研究发现，推理模型优于非推理模型，将推理能力转移到非推理模型中可以显著提高复杂多模态任务的表现。

Exploring Reasoning Reward Model for Agents

Authors: Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, Xiangyu Yue

First: 2026-01-29T18:59:52+00:00 · Latest: 2026-01-29T18:59:52+00:00

Comments: Project page: https://github.com/kxfan2002/Reagent

Abs · PDF · Code1 · Code2 · Code3

Abstract

Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.

中文标题/摘要

标题：探索代理推理奖励模型

代理强化学习（Agentic RL）在使代理执行复杂推理和工具使用方面取得了显著成功。然而，大多数方法仍然依赖于稀疏的结果奖励进行训练。这种反馈无法区分中间推理的质量，导致训练结果不佳。在本文中，我们引入了代理推理奖励模型（Agent-RRM），这是一种多方面的奖励模型，为代理轨迹提供结构化的反馈，包括（1）明确的推理轨迹，（2）聚焦的批评，通过突出推理缺陷提供改进指导，以及（3）总体评分，评估过程表现。利用这些信号，我们系统地研究了三种整合策略：Reagent-C（文本增强改进），Reagent-R（奖励增强指导）和Reagent-U（统一反馈整合）。在12个不同的基准测试中进行的广泛评估表明，Reagent-U带来了显著的性能提升，在GAIA上达到43.7%，在WebWalkerQA上达到46.2%，验证了我们推理奖励模型和训练方案的有效性。代码、模型和数据集均已发布，以促进未来的研究。

Summary / 总结

This paper addresses the limitations of sparse outcome-based rewards in Agentic Reinforcement Learning by proposing the Agent Reasoning Reward Model (Agent-RRM). Agent-RRM provides structured feedback through an explicit reasoning trace, focused critique, and an overall score. Three integration strategies—Reagent-C, Reagent-R, and Reagent-U—are evaluated, with Reagent-U showing significant performance improvements, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, thereby validating the effectiveness of the reasoning reward model and training schemes.

本文提出了一种名为Agent Reasoning Reward Model (Agent-RRM) 的多维度奖励模型，该模型提供包括推理轨迹、重点批评和总体评分在内的结构化反馈。三种集成策略——Reagent-C、Reagent-R 和 Reagent-U——被评估，其中Reagent-U 显示出显著的性能提升，在GAIA 和 WebWalkerQA 基准测试中分别达到了43.7% 和 46.2% 的成绩。

DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation

Authors: Haozhe Xie, Beichen Wen, Jiarui Zheng, Zhaoxi Chen, Fangzhou Hong, Haiwen Diao, Ziwei Liu

Venue: www

First: 2026-01-29T18:59:51+00:00 · Latest: 2026-01-29T18:59:51+00:00

Comments: Project Page: https://www.infinitescript.com/project/dynamic-vla/ GitHub: https://github.com/hzxie/DynamicVLA

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control. We present DynamicVLA, a framework for dynamic object manipulation that integrates temporal reasoning and closed-loop adaptation through three key designs: 1) a compact 0.4B VLA using a convolutional vision encoder for spatially efficient, structurally faithful encoding, enabling fast multimodal inference; 2) Continuous Inference, enabling overlapping reasoning and execution for lower latency and timely adaptation to object motion; and 3) Latent-aware Action Streaming, which bridges the perception-execution gap by enforcing temporally aligned action execution. To fill the missing foundation of dynamic manipulation data, we introduce the Dynamic Object Manipulation (DOM) benchmark, built from scratch with an auto data collection pipeline that efficiently gathers 200K synthetic episodes across 2.8K scenes and 206 objects, and enables fast collection of 2K real-world episodes without teleoperation. Extensive evaluations demonstrate remarkable improvements in response speed, perception, and generalization, positioning DynamicVLA as a unified framework for general dynamic object manipulation across embodiments.

中文标题/摘要

标题：DynamicVLA：一种用于动态物体操作的视觉-语言-行动模型

对于视觉-语言-行动（VLA）模型而言，操作动态物体仍然是一个开放的挑战，尽管它们在静态操作中表现出强大的泛化能力，但在需要快速感知、时间预测和连续控制的动态场景中却难以应对。我们提出了DynamicVLA，这是一种结合了时间推理和闭环适应的动态物体操作框架，通过三个关键设计实现：1）一个紧凑的0.4B VLA，使用卷积视觉编码器进行空间高效、结构忠实的编码，以实现快速多模态推理；2）连续推理，实现重叠的推理和执行，以降低延迟并及时适应物体运动；3）潜在感知动作流，通过强制执行时间对齐的动作执行来弥合感知-执行差距。为了填补动态操作数据的基础，我们引入了Dynamic Object Manipulation（DOM）基准，该基准从头开始构建，使用自动数据收集管道高效地收集了跨越2800个场景和206个物体的20万合成集，并能够快速收集2000个无需远程操作的真实世界集。广泛的评估表明，DynamicVLA在响应速度、感知和泛化方面取得了显著改进，将其定位为一种统一框架，适用于各种体态下的通用动态物体操作。

Summary / 总结

DynamicVLA is a framework for dynamic object manipulation that addresses the challenges of rapid perception, temporal anticipation, and continuous control. It integrates temporal reasoning and closed-loop adaptation through a compact vision-language-action model, continuous inference, and latent-aware action streaming. DynamicVLA demonstrates significant improvements in response speed, perception, and generalization, making it a unified framework for dynamic object manipulation across different embodiments. The framework is supported by the DOM benchmark, which provides a large dataset of synthetic and real-world episodes for training and evaluation.

DynamicVLA旨在通过集成时间推理和闭环适应来解决VLA模型在动态物体操作中的挑战。它使用紧凑的卷积视觉编码器进行高效的多模态推理，使用连续推理以降低延迟，并使用潜意识的动作流传输来对齐感知和执行。该框架在新引入的DOM基准上进行了评估，该基准包括200K合成和2K真实世界的场景，展示了在不同实体上的响应速度、感知和泛化能力的改进。

Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions

Authors: Xiaoxiao Sun, Mingyang Li, Kun yuan, Min Woo Sun, Mark Endo, Shengguang Wu, Changlin Li, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy

First: 2026-01-29T18:59:24+00:00 · Latest: 2026-01-29T18:59:24+00:00

Comments: 26 pages, 31 figures, 13 tables. Project Page: https://sites.google.com/view/vi-probe/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Large Vision-Language Models (VLMs) often answer classic visual illusions "correctly" on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change. Data and code are available at https://sites.google.com/view/vi-probe/.

中文标题/摘要

标题：VLMs 是感知还是回忆？经典视觉错觉探究视觉感知与记忆

大型视觉-语言模型（VLMs）在原始图像上对经典视觉错觉通常会给出正确的回答，但在错觉因素反转后仍坚持相同的回答，尽管人类可以明显察觉视觉变化。这引发了一个基本问题：VLMs 是感知视觉变化还是仅仅回忆已记忆的模式？尽管已有几项研究注意到了这一现象，但其背后的成因仍不清楚。为了从观察转向系统理解，本文引入了VI-Probe，这是一种可控的视觉错觉框架，具有分级扰动和匹配的视觉对照（无错觉诱导器），以解开基于视觉的感知与语言驱动的回忆之间的关系。不同于以往工作主要关注平均准确率，我们使用极性反转一致性、模板固定指数和与匹配对照归一化的错觉乘数来衡量稳定性和敏感性。不同家族的实验表明，反应持久性源自多种原因而非单一机制。例如，GPT-5 表现出记忆覆盖，Claude-Opus-4.1 显示感知与记忆的竞争，而 Qwen 变体则表明视觉处理的限制。我们的发现挑战了单一成因的观点，并促使基于探针的评估，以衡量知识和对受控视觉变化的敏感性。数据和代码可在 https://sites.google.com/view/vi-probe/ 获取。

Summary / 总结

This paper investigates whether large vision-language models (VLMs) perceive visual changes or merely recall memorized patterns by using a controllable visual-illusion framework called VI-Probe. The study measures stability and sensitivity of VLMs using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier. The experiments across different VLMs reveal that response persistence arises from various causes, challenging the single-cause view and emphasizing the need for probing-based evaluation that assesses both knowledge and sensitivity to controlled visual change.

该研究通过使用可控视觉错觉框架VI-Probe，探讨大型视觉-语言模型（VLMs）是感知视觉变化还是仅回忆记忆模式。研究使用极性反转一致性、模板固定指数和与匹配控制相比的错觉乘数来衡量稳定性和敏感性。不同VLMs的实验结果显示，响应持久性源于多种原因，挑战了单一原因的观点，并强调了需要进行探针式评估，以衡量模型对受控视觉变化的知识和敏感性。

DynaWeb: Model-Based Reinforcement Learning of Web Agents

Authors: Hang Ding, Peidong Liu, Junqiao Wang, Ziwei Ji, Meng Cao, Rongzhao Zhang, Lynn Ai, Eric Yang, Tianyu Shi, Lei Yu

First: 2026-01-29T18:59:07+00:00 · Latest: 2026-01-29T18:59:07+00:00

Abs · PDF · Code1 · Code2

Abstract

The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.

中文标题/摘要

标题：DynaWeb：基于模型的强化学习网页代理

自主网页代理的发展，由大型语言模型（LLMs）和强化学习（RL）驱动，代表了通用人工智能助手的重要一步。然而，训练这些代理受到与实时互联网交互的挑战的严重阻碍，这些挑战包括效率低下、成本高昂和风险高。基于模型的强化学习（MBRL）提供了一种有希望的解决方案，通过学习环境模型来实现模拟交互。本文介绍了DynaWeb，这是一种新颖的MBRL框架，通过与训练以预测给定代理行为的自然网页表示的网页世界模型交互来训练网页代理。该模型充当合成的网页环境，代理策略可以在其中生成大量回放动作轨迹以实现高效的在线强化学习。除了免费策略回放外，DynaWeb还整合了来自训练数据的真实专家轨迹，这些轨迹在训练过程中随机与策略回放交织，以提高稳定性和样本效率。在具有挑战性的WebArena和WebVoyager基准测试上的实验表明，DynaWeb能够一致且显著地提高最先进的开源网页代理模型的性能。我们的研究结果证明了通过想象训练网页代理的可行性，提供了一种可扩展且高效的方法来扩展在线代理性强化学习。

Summary / 总结

DynaWeb is a model-based reinforcement learning framework designed to train web agents using a simulated web environment. It addresses the challenges of real-world interaction by learning a world model that predicts web page representations based on agent actions. DynaWeb incorporates both free policy rollouts and interleaved real expert trajectories to enhance stability and sample efficiency. Experiments show that DynaWeb significantly improves the performance of state-of-the-art web agent models on benchmarks like WebArena and WebVoyager.

DynaWeb 是一种基于模型的强化学习框架，旨在通过模拟的网络环境来训练网络代理。它通过学习一个能够根据代理行为预测网页表示的世界模型来解决现实世界网络交互的挑战。DynaWeb 结合了自由策略模拟和真实专家轨迹，以提高稳定性和效率。实验结果表明，DynaWeb 显著提升了最先进的开源网络代理模型的性能。

FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Authors: Ajay Patel, Colin Raffel, Chris Callison-Burch

First: 2026-01-29T18:58:47+00:00 · Latest: 2026-01-29T18:58:47+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Due to limited supervised training data, large language models (LLMs) are typically pre-trained via a self-supervised "predict the next word" objective on a vast amount of unstructured text data. To make the resulting model useful to users, it is further trained on a far smaller amount of "instruction-tuning" data comprised of supervised training examples of instructions and responses. To overcome the limited amount of supervised data, we propose a procedure that can transform the knowledge in internet-scale pre-training documents into billions of synthetic instruction and answer training pairs. The resulting dataset, called FineInstructions, uses ~18M instruction templates created from real user-written queries and prompts. These instruction templates are matched to and instantiated with human-written source documents from unstructured pre-training corpora. With "supervised" synthetic training data generated at this scale, an LLM can be pre-trained from scratch solely with the instruction-tuning objective, which is far more in-distribution with the expected downstream usage of LLMs (responding to user prompts). We conduct controlled token-for-token training experiments and find pre-training on FineInstructions outperforms standard pre-training and other proposed synthetic pre-training techniques on standard benchmarks measuring free-form response quality. Our resources can be found at https://huggingface.co/fineinstructions .

中文标题/摘要

标题：FineInstructions: 将预训练规模扩展到合成指令

由于监督训练数据有限，大型语言模型（LLMs）通常通过在大量无结构文本数据上使用自我监督的“预测下一个词”目标进行预训练。为了使最终模型对用户有用，它还会在少量“指令调优”数据上进行进一步训练，这些数据由指令和响应的监督训练示例组成。为了克服有限的监督数据，我们提出了一种程序，可以将互联网规模预训练文档中的知识转化为数十亿组合成指令和答案训练对。由此产生的数据集称为FineInstructions，使用来自真实用户查询和提示的约1800万指令模板。这些指令模板与并实例化自无结构预训练语料库中的人类撰写的源文档。通过在如此大规模的“监督”合成训练数据上进行预训练，LLM可以从头开始仅使用指令调优目标进行预训练，这与LLM预期下游使用情况（响应用户提示）更为一致。我们进行了受控的逐词训练实验，并发现使用FineInstructions进行预训练在衡量自由形式响应质量的标准基准上优于标准预训练和其他提出的合成预训练技术。我们的资源可以在https://huggingface.co/fineinstructions 获取。

Summary / 总结

The research aims to address the limitation of supervised training data for large language models (LLMs) by proposing a method to generate synthetic instruction and answer pairs from internet-scale pre-training documents. The method involves using 18 million instruction templates derived from real user queries and prompts, which are then matched and instantiated with human-written source documents. The resulting FineInstructions dataset enables pre-training LLMs solely with the instruction-tuning objective, leading to better performance on standard benchmarks compared to standard pre-training and other synthetic pre-training techniques.

研究旨在通过从互联网规模的预训练文档中生成合成指令和答案对来解决大型语言模型（LLMs）监督训练数据有限的问题。方法包括使用来自真实用户查询和提示的1800万个指令模板，并将其与人类撰写的源文档匹配和实例化。生成的FineInstructions数据集使LLMs能够仅通过指令调优目标进行从头预训练，从而在标准基准测试中表现出更好的自由形式响应质量，优于标准预训练和其他合成预训练技术。

Think Twice: Branch-and-Rethink Reasoning Reward Model

Authors: Yizhu Jiao, Jiaqi Zeng, Julien Veron Vialard, Oleksii Kuchaiev, Jiawei Han, Olivier Delalleau

First: 2025-10-27T17:58:07+00:00 · Latest: 2026-01-29T18:57:46+00:00

Comments: Source Code: https://github.com/yzjiao/BR-RM. Model Checkpoints: https://huggingface.co/nvidia/Qwen3-Nemotron-14B-BRRM and https://huggingface.co/nvidia/Qwen3-Nemotron-8B-BRRM

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-once scoring into focused, second-look reasoning, BR-RM reduces judgment diffusion and improves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains.

中文标题/摘要

标题：思两次：分支与重思奖励推理模型

大型语言模型（LLMs）越来越多地依赖于将中间步骤外部化并分配额外的测试时计算量的思考模型，其中思两次策略表明，经过深思熟虑的第二次思考可以激发更强的推理能力。相比之下，大多数奖励模型（RMs）仍然一次性压缩多个质量维度为单一标量，这种设计会导致判断分散：注意力在评估标准之间分散，导致分散的焦点和浅薄的分析。我们引入了分支与重思（BR-RM），这是一种两轮的奖励模型，将思两次的原则转移到奖励建模中。第一轮执行自适应分支，选择一组关键实例维度（如事实性和安全性），并勾勒出简明、证据导向的假设。第二轮执行条件分支重思，这是一种有针对性的重读，测试这些假设并仅审查最重要的内容。我们使用基于GRPO风格的强化学习训练结构化的两轮轨迹，使用简单的二元结果奖励并进行严格的格式检查，使该方法与标准的RLHF流水线兼容。通过将一次性评分转换为集中、二次审视的推理，BR-RM减少了判断分散，提高了对微妙但重要的错误的敏感性，同时保持了实用性和可扩展性。实验结果表明，我们的模型在三个跨领域挑战性的奖励建模基准上达到了最先进的性能。

Summary / 总结

The paper introduces branch-and-rethink (BR-RM), a two-turn reward model that applies the think-twice principle to reward modeling. It performs adaptive branching in the first turn to select critical dimensions and formulate concise hypotheses, followed by a targeted rethinking in the second turn to test these hypotheses. The model is trained using reinforcement learning over structured two-turn traces with a simple binary outcome reward. Experimental results show that BR-RM outperforms existing models on three challenging reward modeling benchmarks, improving sensitivity to subtle errors while maintaining practicality and scalability.

研究旨在通过引入两轮推理模型分支-重思(BR-RM)，解决奖励模型中的判断扩散问题，该模型进行自适应分支和目标重思。实验结果表明，BR-RM 在跨多个领域的三个具有挑战性的奖励建模基准测试中达到了最先进的性能，同时保持了实用性和可扩展性。

MORPH: PDE Foundation Models with Arbitrary Data Modality

Authors: Mahindra Singh Rautela, Alexander Most, Siddharth Mansingh, Bradley C. Love, Alexander Scheinker, Diane Oyen, Nathan Debardeleben, Earl Lawrence, Ayan Biswas

First: 2025-09-25T22:38:36+00:00 · Latest: 2026-01-29T18:57:23+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce MORPH, a modality-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data modality (1D--3D) at different resolutions, and multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorize full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters, MORPH outperforms models trained from scratch. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from the heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning. The source code, datasets, and models are publicly available at https://github.com/lanl/MORPH.

中文标题/摘要

标题：MORPH：任意数据模态的偏微分方程基础模型

我们介绍了MORPH，一种模态无关的自回归偏微分方程（PDE）基础模型。MORPH基于卷积视觉变换器骨干网络，能够无缝处理不同数据模态（1D-3D）和不同分辨率的异质时空数据集，以及具有混合标量和矢量分量的多个字段。该架构结合了(i)分量卷积，联合处理标量和矢量通道以捕捉局部交互，(ii)跨字段交叉注意力，建模并选择性地传播不同物理场之间的信息，(iii)轴向注意力，沿个体空间和时间轴分解全时空自注意力，以减少计算负担同时保留表达能力。我们使用多样化的异质PDE数据集对多个模型变体进行预训练，并评估其在一系列下游预测任务中的迁移性能。使用全模型微调和参数高效低秩适配器，MORPH优于从头训练的模型。在广泛的评估中，MORPH匹配或超越了强大的基线和最近的先进模型。这些能力共同展示了学习科学观测的异构和多模态性质的灵活而强大的基础架构，为可扩展和数据高效的科学机器学习铺平了道路。源代码、数据集和模型可在https://github.com/lanl/MORPH/公开获取。

Routing the Lottery: Adaptive Subnetworks for Heterogeneous Data

Authors: Grzegorz Stefanski, Alberto Presta, Michal Byra

First: 2026-01-29T18:56:41+00:00 · Latest: 2026-01-29T18:56:41+00:00

Abs · PDF · Code1 · Code2

Abstract

In pruning, the Lottery Ticket Hypothesis posits that large networks contain sparse subnetworks, or winning tickets, that can be trained in isolation to match the performance of their dense counterparts. However, most existing approaches assume a single universal winning ticket shared across all inputs, ignoring the inherent heterogeneity of real-world data. In this work, we propose Routing the Lottery (RTL), an adaptive pruning framework that discovers multiple specialized subnetworks, called adaptive tickets, each tailored to a class, semantic cluster, or environmental condition. Across diverse datasets and tasks, RTL consistently outperforms single- and multi-model baselines in balanced accuracy and recall, while using up to 10 times fewer parameters than independent models and exhibiting semantically aligned. Furthermore, we identify subnetwork collapse, a performance drop under aggressive pruning, and introduce a subnetwork similarity score that enables label-free diagnosis of oversparsification. Overall, our results recast pruning as a mechanism for aligning model structure with data heterogeneity, paving the way toward more modular and context-aware deep learning.

中文标题/摘要

标题：路由彩票：适应性子网络用于异构数据

在剪枝中，彩票票假说认为大型网络包含稀疏的子网络，或胜出的彩票，这些子网络可以在隔离状态下训练以匹配其密集对应物的性能。然而，大多数现有方法假设所有输入共享一个通用的胜出彩票，忽略了真实数据的固有异质性。在本文中，我们提出了路由彩票（RTL），一种适应性剪枝框架，发现多个专门化的子网络，称为适应性彩票，每个子网络都针对一个类别、语义簇或环境条件进行定制。在多种数据集和任务上，RTL在平衡准确率和召回率方面始终优于单模型和多模型基线，同时使用比独立模型少10倍的参数，并且具有语义对齐。此外，我们识别了子网络崩溃，即在激进剪枝下性能下降，并引入了子网络相似度评分，以实现无标签诊断过度稀疏化。总体而言，我们的结果重新定义了剪枝作为机制，以使模型结构与数据异质性对齐，为更模块化和上下文感知的深度学习铺平了道路。

Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

Authors: Xin Chen, Feng Jiang, Yiqian Zhang, Hardy Chen, Shuo Yan, Wenya Xie, Min Yang, Shujian Huang

First: 2026-01-29T18:56:12+00:00 · Latest: 2026-01-29T18:56:12+00:00

Comments: The manuscript is under review

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70\% higher accuracy, 22.90\% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}

中文标题/摘要

标题：边问边思：将大型语言模型从被动求解者转变为积极问询者

以推理为导向的大规模语言模型（LLMs）通过链式思考（CoT）提示取得了显著进展，但它们仍然受限于一种“盲目的自我思考”范式：即使关键信息缺失或模糊不清，它们也会进行大量内部推理。我们提出了积极互动推理（PIR），这是一种新的推理范式，将LLMs从被动求解者转变为能够边推理边求证的积极问询者。PIR 不像现有的基于搜索或工具的框架主要通过查询外部环境来解决知识不确定性，而是通过直接与用户互动来解决前提和意图层面的不确定性。PIR 通过两个核心组件实现：（1）一种基于不确定性的监督微调程序，使模型具备互动推理能力；（2）一种基于用户模拟器的策略优化框架，由复合奖励驱动，使模型行为与用户意图保持一致。在数学推理、代码生成和文档编辑方面的广泛实验表明，PIR 一贯优于强大的基线模型，准确率提高32.70%，通过率提高22.90%，BLEU值提高41.36%，同时减少了近一半的推理计算和不必要的互动轮次。进一步的可靠性评估表明，PIR 在事实知识、问答和缺失前提场景中具有强大的泛化能力和鲁棒性。模型和代码已公开：https://github.com/SUAT-AIRI/Proactive-Interactive-R1

Summary / 总结

The paper introduces Proactive Interactive Reasoning (PIR), a new paradigm for Large Language Models (LLMs) that transforms them from passive solvers to proactive inquirers. PIR interleaves reasoning with user interaction to address premise- and intent-level uncertainty. Experiments show PIR outperforms strong baselines in mathematical reasoning, code generation, and document editing, with higher accuracy, pass rates, and BLEU scores, while reducing unnecessary interactions and reasoning computation. Further evaluations confirm PIR's robustness and generalization across different scenarios.

论文提出了主动互动推理（PIR）的新范式，将推理导向的大语言模型（LLMs）从被动求解者转变为积极的问询者。PIR 将推理与用户交互相结合，以解决前提和意图层面的不确定性。实验表明，PIR 在数学推理、代码生成和文档编辑中优于强基线，具有更高的准确率和通过率，并提高了效率。进一步的评估证实了PIR 的稳健性和泛化能力。

"Not in My Backyard": LLMs Uncover Online and Offline Social Biases Against Homelessness

Authors: Jonathan A. Karr, Benjamin F. Herbst, Matthew L. Sisk, Xueyun Li, Ting Hua, Matthew Hauenstein, Georgina Curto, Nitesh V. Chawla

First: 2025-08-14T17:58:34+00:00 · Latest: 2026-01-29T18:55:57+00:00

Abs · PDF · Code1 · Code2

Abstract

Homelessness is a persistent social challenge, impacting millions worldwide. Over 876,000 people experienced homelessness (PEH) in the U.S. in 2025. Social bias is a significant barrier to alleviation, shaping public perception and influencing policymaking. Given that online textual media and offline city council discourse reflect and influence part of public opinion, it provides valuable insights to identify and track social biases against PEH. We present a new, manually-annotated multi-domain dataset compiled from Reddit, X (formerly Twitter), news articles, and city council meeting minutes across ten U.S. cities. Our 16-category multi-label taxonomy creates a challenging long-tail classification problem: some categories appear in less than 1% of samples, while others exceed 70%. We find that small human-annotated datasets (1,702 samples) are insufficient for training effective classifiers, whether used to fine-tune encoder models or as few-shot examples for LLMs. To address this, we use GPT-4.1 to generate pseudo-labels on a larger unlabeled corpus. Training on this expanded dataset enables even small encoder models (ModernBERT, 150M parameters) to achieve 35.23 macro-F1, approaching GPT-4.1's 41.57. This demonstrates that \textbf{data quantity matters more than model size}, enabling low-cost, privacy-preserving deployment without relying on commercial APIs. Our results reveal that negative bias against PEH is prevalent both offline and online (especially on Reddit), with "not in my backyard" narratives showing the highest engagement. These findings uncover a type of ostracism that directly impacts poverty-reduction policymaking and provide actionable insights for practitioners addressing homelessness.

中文标题/摘要

标题："不在我的后院": 大型语言模型揭示对无家可归者的线上线下社会偏见

无家可归是一个持续的社会挑战，影响着全世界数百万人。2025年，美国有超过876,000人经历无家可归（PEH）。社会偏见是缓解这一问题的重要障碍，影响公众认知并影响政策制定。鉴于在线文本媒体和线下城市议会讨论反映了部分公众意见并对其产生影响，它们提供了识别和追踪对PEH的社会偏见的重要见解。我们提出了一项新的、人工标注的多领域数据集，该数据集来自来自美国十个城市的Reddit、X（原Twitter）、新闻文章和城市议会会议记录。我们的16类多标签分类体系创建了一个具有挑战性的长尾分类问题：一些类别在样本中出现的比例不到1%，而其他类别则超过70%。我们发现，小规模的人工标注数据集（1,702个样本）不足以训练有效的分类器，无论是用于微调编码器模型还是作为LLM的少量示例。为了解决这个问题，我们使用GPT-4.1在更大规模的未标注语料库上生成伪标签。在扩展数据集上进行训练使即使是小型编码器模型（ModernBERT，1.5亿参数）也能达到35.23的宏F1值，接近GPT-4.1的41.57。这表明数据量比模型规模更重要，能够实现低成本、隐私保护的部署，无需依赖商业API。我们的研究结果揭示了无家可归者在线下和线上（尤其是Reddit）都普遍存在负面偏见，"不在我的后院"的叙事具有最高的参与度。这些发现揭示了一种直接对减贫政策产生影响的排斥行为，并为应对无家可归问题的从业者提供了可操作的见解。

Summary / 总结

This study investigates social biases against homelessness by analyzing online and offline data. A new dataset was created from Reddit, X, news articles, and city council meeting minutes. The dataset includes 16 categories and is challenging due to its long-tail distribution. The research found that small annotated datasets are insufficient for training effective classifiers. Instead, GPT-4.1 was used to generate pseudo-labels, which improved the performance of small models. The study revealed that negative bias against homelessness is prevalent both online and offline, with 'not in my backyard' narratives being the most engaging. This highlights the need for addressing these biases in poverty-reduction policymaking.

研究通过分析线上和线下数据，调查了对无家可归者的社会偏见。创建了一个新数据集，包含来自Reddit、X、新闻文章和城市议会会议记录的数据，共有16个类别，分布呈长尾状。研究发现，小规模标注数据集不足以训练有效的分类器。相反，使用GPT-4.1生成伪标签，提高了小型模型的性能。研究揭示了对无家可归者的负面偏见在线上线下都很普遍，特别是“不在我的后院”叙事最具吸引力。这表明需要在减少贫困的政策制定中解决这些偏见。

StepShield: When, Not Whether to Intervene on Rogue Agents

Authors: Gloria Felicia, Michael Eniolade, Jinfeng He, Zitha Sasindran, Hemant Kumar, Milan Hussain Angati, Sandeep Bandarupalli

First: 2026-01-29T18:55:46+00:00 · Latest: 2026-01-29T18:55:46+00:00

Comments: 16 pages, 2 figures, 14 tables

Abs · PDF · Code1 · Code2

Abstract

Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis. A detector that flags a violation at step 8 enables intervention; one that reports it at step 48 provides only forensic value. This distinction is critical, yet current benchmarks cannot measure it. We introduce StepShield, the first benchmark to evaluate when violations are detected, not just whether. StepShield contains 9,213 code agent trajectories, including 1,278 meticulously annotated training pairs and a 7,935-trajectory test set with a realistic 8.1% rogue rate. Rogue behaviors are grounded in real-world security incidents across six categories. We propose three novel temporal metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. Surprisingly, our evaluation reveals that an LLM-based judge achieves 59% EIR while a static analyzer achieves only 26%, a 2.3x performance gap that is entirely invisible to standard accuracy metrics. We further show that early detection has direct economic benefits: our cascaded HybridGuard detector reduces monitoring costs by 75% and projects to $108M in cumulative savings over five years at enterprise scale. By shifting the focus of evaluation from whether to when, StepShield provides a new foundation for building safer and more economically viable AI agents. The code and data are released under an Apache 2.0 license.

中文标题/摘要

标题：StepShield：何时干预，而非是否干预违规代理

现有的代理安全性基准报告二元准确性，混淆了早期干预与事后分析。在第8步检测到违规行为时可以进行干预，而在第48步报告则仅具有法医价值。这种区别至关重要，但当前的基准无法衡量这一点。我们引入了StepShield，这是第一个评估检测到违规行为的时间点，而不仅仅是是否检测到的基准。StepShield包含9,213个代码代理轨迹，包括1,278个精心标注的训练对和一个包含7,935个轨迹的测试集，现实中的违规率为8.1%。违规行为基于六类真实世界的安全事件。我们提出了三个新的时间度量标准：早期干预率（EIR）、干预差距和节省的令牌数。令人惊讶的是，我们的评估表明，基于LLM的法官实现了59%的EIR，而静态分析器仅实现了26%，标准准确性指标完全无法看到这种2.3倍的性能差距。我们进一步表明，早期检测具有直接的经济效益：我们的级联HybridGuard检测器将监控成本降低了75%，并在企业规模上预计在未来五年内节省1.08亿美元。通过将评估的重点从是否转移到何时，StepShield为构建更安全且更具经济效益的AI代理提供了新的基础。代码和数据在Apache 2.0许可证下发布。

Summary / 总结

StepShield evaluates the timing of violation detection in code agents, addressing the limitations of existing benchmarks that only report binary accuracy. It introduces three novel metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. The study finds that an LLM-based judge detects violations 2.3x more effectively than a static analyzer, with a 59% EIR, while also demonstrating significant economic benefits by reducing monitoring costs by 75% at scale. By focusing on when violations are detected, StepShield offers a new framework for improving agent safety and economic efficiency.

StepShield 评估代理安全中的违规检测时机，区分早期干预和事后分析。它引入了三个新型指标：早期干预率（EIR）、干预间隔和节省的标记数。研究发现，基于语言模型的裁判比静态分析器表现更好，EIR 达到 59%，而静态分析器仅为 26%，传统准确率指标未能捕捉到这一显著性能差距。早期检测还带来了显著的成本节约，级联的 HybridGuard 检测器将监控成本降低了 75%，并预计在未来五年的企业规模中可节省 10.8 亿美元。

Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference

Authors: Ziming Dong, Hardik Sharma, Evan O'Toole, Jaya Prakash Champati, Kui Wu

First: 2026-01-29T18:52:54+00:00 · Latest: 2026-01-29T18:52:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) deliver state-of-the-art performance on complex reasoning tasks, but their inference costs limit deployment at scale. Small Language Models (SLMs) offer dramatic cost savings yet lag substantially in accuracy. Existing approaches - routing and cascading - treat the LLM as an all-or-nothing resource: either the query bypasses the LLM entirely, or the LLM generates a complete response at full cost. We introduce LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to SLM. This simple mechanism is surprisingly effective for math and coding tasks: even hints comprising 10-30% of the full LLM response improve SLM accuracy significantly. Shepherding generalizes both routing and cascading, and it achieves lower cost under oracle decision-making. We develop a two-stage predictor that jointly determines whether a hint is needed and how many tokens to request. On the widely-used mathematical reasoning (GSM8K, CNK12) and code generation (HumanEval, MBPP) benchmarks, Shepherding reduces costs by 42-94% relative to LLM-only inference. Compared to state-of-the-art routing and cascading baselines, shepherding delivers up to 2.8x cost reduction while matching accuracy. To our knowledge, this is the first work to exploit token-level budget control for SLM-LLM collaboration.

中文标题/摘要

标题：付费获取提示，而非答案：LLM 管理以实现成本效益推理

大型语言模型（LLMs）在复杂推理任务上提供最先进的性能，但其推理成本限制了大规模部署。小型语言模型（SLMs）提供显著的成本节省，但在准确性上落后很多。现有方法——路由和级联——将LLM视为全有或全无的资源：查询要么完全绕过LLM，要么LLM以全额成本生成完整响应。我们引入了LLM 管理框架，该框架仅请求LLM提供一个短前缀（提示），并将其提供给SLM。这种简单的机制在数学和编程任务中表现出惊人的效果：即使提示仅占完整LLM响应的10-30%，也能显著提高SLM的准确性。LLM 管理既涵盖了路由和级联，又在最优决策下实现了更低的成本。我们开发了一种两阶段预测器，以共同确定是否需要提示以及请求多少令牌。在广泛使用的数学推理（GSM8K，CNK12）和代码生成（HumanEval，MBPP）基准测试中，LLM 管理将成本降低了42-94%。与最先进的路由和级联基线相比，LLM 管理在成本上最多可减少2.8倍，同时保持相同的准确性。据我们所知，这是首次利用令牌级预算控制实现SLM-LLM协作的工作。

Summary / 总结

The research aims to reduce the cost of using Large Language Models (LLMs) for complex reasoning tasks by introducing LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to Small Language Models (SLMs). This approach improves SLM accuracy significantly, even with hints comprising 10-30% of the full LLM response. On benchmarks for mathematical reasoning and code generation, Shepherding reduces costs by 42-94% compared to using LLMs alone, and it achieves up to 2.8x cost reduction while maintaining accuracy, outperforming existing routing and cascading methods.

本文提出了LLM牧羊人方法，该方法仅从大型语言模型（LLM）请求一个短前缀（提示），并将其提供给小型语言模型（SLM），以提高SLM的准确性并降低42-94%的成本，特别是在数学和编程任务上。两阶段预测器确定何时以及请求多少令牌，从而在不牺牲准确性的前提下实现最高2.8倍的成本降低。

World of Workflows: a Benchmark for Bringing World Models to Enterprise Systems

Authors: Lakshya Gupta, Litao Li, Yizhe Liu, Sriram Ganapathi Subramanian, Kaheer Suleman, Zichen Zhang, Haoye Lu, Sumit Pasupalak

First: 2026-01-29T18:51:54+00:00 · Latest: 2026-01-29T18:51:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Frontier large language models (LLMs) excel as autonomous agents in many domains, yet they remain untested in complex enterprise systems where hidden workflows create cascading effects across interconnected databases. Existing enterprise benchmarks evaluate surface-level agentic task completion similar to general consumer benchmarks, ignoring true challenges in enterprises, such as limited observability, large database state, and hidden workflows with cascading side effects. We introduce World of Workflows (WoW), a realistic ServiceNow-based environment incorporating 4,000+ business rules and 55 active workflows embedded in the system, alongside WoW-bench, a benchmark of 234 tasks evaluating constrained agentic task completion and enterprise dynamics modeling capabilities. We reveal two major takeaways: (1) Frontier LLMs suffer from dynamics blindness, consistently failing to predict the invisible, cascading side effects of their actions, which leads to silent constraint violations, and (2) reliability in opaque systems requires grounded world modeling, where agents must mentally simulate hidden state transitions to bridge the observability gap when high-fidelity feedback is unavailable. For reliable and useful enterprise agents, WoW motivates a new paradigm to explicitly learn system dynamics. We release our GitHub for setting up and evaluating WoW.

中文标题/摘要

标题：工作流世界：将世界模型引入企业系统的基准

前沿大型语言模型（LLMs）在许多领域作为自主代理表现出色，但在复杂的企业系统中仍未经测试，这些系统中的隐藏工作流会在相互连接的数据库之间产生级联效应。现有的企业基准测试评估表面级别的代理任务完成，类似于通用消费者基准测试，忽略了企业中的真正挑战，如有限的可观测性、庞大的数据库状态以及隐藏的工作流及其级联副作用。我们引入了工作流世界（WoW），这是一个基于ServiceNow的现实环境，包含4000多个业务规则和55个嵌入系统中的活跃工作流，以及WoW基准，这是一个包含234个任务的基准，评估受限的代理任务完成能力和企业动态建模能力。我们揭示了两个主要发现：（1）前沿LLMs存在动态盲点，一致地未能预测其行为的隐形级联副作用，导致无声的约束违规；（2）在不透明系统中需要基于世界建模的可靠性，代理必须在高保真反馈不可用时在心中模拟隐藏状态转换以弥合可观测性差距。为了实现可靠且有用的企业代理，WoW激励了一种新的范式，即明确学习系统动力学。我们发布了GitHub以设置和评估WoW。

Summary / 总结

The research introduces World of Workflows (WoW), a benchmark for testing large language models (LLMs) in complex enterprise systems, which are characterized by hidden workflows and cascading side effects. WoW uses a ServiceNow-based environment with 4,000+ business rules and 55 active workflows. The study reveals that frontier LLMs struggle with predicting cascading side effects, leading to silent constraint violations, and emphasizes the need for grounded world modeling to bridge the observability gap in opaque systems. The WoW-bench evaluates 234 tasks focusing on constrained agentic task completion and enterprise dynamics modeling capabilities.

研究引入了World of Workflows (WoW)，一个用于测试大型语言模型（LLMs）在复杂企业系统中的基准，这些系统具有隐藏的工作流和级联副作用。WoW 使用了一个包含 4,000 多个业务规则和 55 个活跃工作流的 ServiceNow 基础环境。研究发现前沿的 LLM 在预测级联副作用方面存在困难，导致无声的约束违规，并强调在不透明系统中需要进行基于世界建模以弥补可观测性差距。WoW-bench 评估了 234 个任务，重点是受限的代理任务完成能力和企业动态建模能力。

SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

Authors: Yifeng Ding, Lingming Zhang

First: 2026-01-29T18:50:29+00:00 · Latest: 2026-01-29T18:50:29+00:00

Abs · PDF · Code1 · Code2

Abstract

Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is computationally expensive. While recent methods have attempted to mitigate costs using specialized value agents, they can suffer from model miscalibration and fail to generalize to modern agents that synthesize custom bash scripts as tools. In this paper, we introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates. SWE-Replay optimizes the scaling process by recycling trajectories from prior trials, dynamically choosing to either explore from scratch or exploit archived experience by branching at critical intermediate steps. This selection of intermediate steps is driven by the potential and reasoning significance of repository exploration, rather than external LLM-based quality estimates. Our evaluation shows that, on SWE-Bench Verified, SWE-Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%. Further evaluation on SWE-Bench Pro and Multilingual validates the generalizability of SWE-Replay, establishing it as a robust foundation for efficient test-time scaling of software engineering agents.

中文标题/摘要

标题：SWE-Replay: 软件工程代理的高效测试时扩展

测试时扩展已被广泛采用以增强大型语言模型（LLM）代理在软件工程（SWE）任务中的能力。然而，标准方法从头开始重复采样轨迹是计算上昂贵的。虽然最近的方法试图通过使用专门的价值代理来减轻成本，但它们可能会遭受模型校准不当的问题，并且无法泛化到现代代理，这些代理会合成自定义的bash脚本作为工具。在本文中，我们介绍了SWE-Replay，这是第一个无需依赖可能噪声的价值估计即可实现现代代理的高效且可泛化的测试时扩展技术。SWE-Replay通过回收先前试验中的轨迹来优化扩展过程，动态选择从头探索或利用存档经验在关键中间步骤进行分支。这种中间步骤的选择是基于仓库探索的潜力和推理意义，而不是外部LLM的质量估计。我们的评估表明，在SWE-Bench Verified上，SWE-Replay始终优于简单的扩展，成本最多可降低17.4%，同时保持或甚至提高性能最多3.8%。进一步在SWE-Bench Pro和多语言上的评估验证了SWE-Replay的泛化能力，确立了其作为软件工程代理高效测试时扩展稳健基础的地位。

Summary / 总结

The paper introduces SWE-Replay, an efficient and generalizable test-time scaling technique for modern software engineering agents. Unlike previous methods that rely on specialized value agents, SWE-Replay recycles trajectories from prior trials and dynamically decides whether to explore from scratch or exploit archived experience at critical steps. This approach, driven by the potential and reasoning significance of repository exploration, reduces costs by up to 17.4% on SWE-Bench Verified while maintaining or improving performance by up to 3.8%. The method also generalizes well to other benchmarks, validating its robustness for efficient test-time scaling.

SWE-Replay 是一种高效的软件工程代理测试时缩放技术，避免了从头开始反复采样轨迹的高计算成本。它动态地从之前的试验中回收轨迹，在关键步骤选择探索或利用存档的经验，基于仓库探索的潜在价值和推理意义。SWE-Replay 在 SWE-Bench Verified 上优于朴素缩放，最多可减少 17.4% 的成本，同时保持或提高性能最多 3.8%。它还很好地适用于 SWE-Bench Pro 和多语言，展示了其在软件工程代理测试时缩放中的稳健性。

The Patient is not a Moving Document: A World Model Training Paradigm for Longitudinal EHR

Authors: Irsyad Adam, Zekai Chen, David Laprade, Shaun Porwal, David Laub, Erik Reinertsen, Arda Pekis, Kevin Brown

First: 2026-01-29T18:49:37+00:00 · Latest: 2026-01-29T18:49:37+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language models (LLMs) trained with next-word-prediction have achieved success as clinical foundation models. Representations from these language backbones yield strong linear probe performance across biomedical tasks, suggesting that patient semantics emerge from next-token prediction at scale. However, this paradigm treats patients as a document to be summarized rather than a dynamical system to be simulated; a patient's trajectory emerges from their state evolving under interventions and time, requiring models that simulate dynamics rather than predict tokens. To address this, we introduce SMB-Structure, a world model for structured EHR that grounds a joint-embedding prediction architecture (JEPA) with next-token prediction (SFT). SFT grounds our model to reconstruct future patient states in token space, while JEPA predicts those futures in latent space from the initial patient representation alone, forcing trajectory dynamics to be encoded before the next state is observed. We validate across two large-scale cohorts: Memorial Sloan Kettering (23,319 oncology patients; 323,000+ patient-years) and INSPECT (19,402 pulmonary embolism patients). Using a linear probe evaluated at multiple points along the disease trajectory, we demonstrate that our training paradigm learns embeddings that capture disease dynamics not recoverable by autoregressive baselines, enabling SMB-Structure to achieve competitive performance on complex tasks characterized by high patient heterogeneity. Model weights are available at https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure.

中文标题/摘要

标题：患者不是移动的文档：一种用于纵向EHR的世界模型训练范式

大型语言模型（LLMs）通过下一个词预测训练，在临床基础模型方面取得了成功。这些语言骨干的表示在生物医学任务中表现出强大的线性探针性能，表明患者语义是从大规模的下一个词预测中涌现出来的。然而，这种范式将患者视为需要总结的文档，而不是需要模拟的动力系统；患者的轨迹是从其状态在干预和时间的作用下演变而来的，需要能够模拟动力学而不是预测词的模型。为了解决这个问题，我们引入了SMB-Structure，这是一种结构化EHR的世界模型，将联合嵌入预测架构（JEPA）与下一个词预测（SFT）相结合。SFT使我们的模型能够重建患者状态在词空间中的未来状态，而JEPA则仅从初始患者表示中预测这些未来状态在潜在空间中，迫使轨迹动力学在观察到下一个状态之前被编码。我们在两个大规模队列中进行了验证：纪念斯隆凯特林（23,319名肿瘤患者；323,000多个患者年）和INSPECT（19,402名肺栓塞患者）。使用沿疾病轨迹多个点评估的线性探针，我们证明我们的训练范式学习到的嵌入捕捉到了自回归基线无法恢复的疾病动力学，使SMB-Structure能够在高患者异质性特征的复杂任务上实现竞争力的性能。模型权重可在https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure 获取。

Summary / 总结

This study addresses the limitation of treating patients as static documents by proposing SMB-Structure, a world model for structured Electronic Health Records (EHR). It combines next-token prediction (SFT) and a joint-embedding prediction architecture (JEPA) to simulate patient dynamics. The model is validated on two large cohorts, showing that it learns embeddings capturing disease dynamics better than autoregressive models, especially for tasks with high patient heterogeneity.

该研究针对将患者视为静态文档的局限性，提出了SMB-Structure，一种结构化电子健康记录的世界模型。该模型结合了下一步预测（SFT）和联合嵌入预测架构（JEPA），以模拟患者动态。该模型在两个大型队列中得到了验证，表明它能够比自回归模型更好地学习捕捉疾病动态，尤其是在高患者异质性任务中表现出色。

EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers

Authors: John Flynn, Wolfgang Paier, Dimitar Dinev, Sam Nhut Nguyen, Hayk Poghosyan, Manuel Toribio, Sandipan Banerjee, Guy Gafni

First: 2026-01-29T18:49:27+00:00 · Latest: 2026-01-29T18:49:27+00:00

Comments: Project page: https://edit-yourself.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Current generative video models excel at producing novel content from text and image prompts, but leave a critical gap in editing existing pre-recorded videos, where minor alterations to the spoken script require preserving motion, temporal coherence, speaker identity, and accurate lip synchronization. We introduce EditYourself, a DiT-based framework for audio-driven video-to-video (V2V) editing that enables transcript-based modification of talking head videos, including the seamless addition, removal, and retiming of visually spoken content. Building on a general-purpose video diffusion model, EditYourself augments its V2V capabilities with audio conditioning and region-aware, edit-focused training extensions. This enables precise lip synchronization and temporally coherent restructuring of existing performances via spatiotemporal inpainting, including the synthesis of realistic human motion in newly added segments, while maintaining visual fidelity and identity consistency over long durations. This work represents a foundational step toward generative video models as practical tools for professional video post-production.

中文标题/摘要

标题：EditYourself：基于音频的对话头视频生成与操控

当前的生成视频模型在从文本和图像提示生成新颖内容方面表现出色，但在编辑现有的预录制视频方面存在关键缺口，其中对讲话脚本的微小修改需要保留动作、时间连贯性、说话人身份和准确的唇部同步。我们介绍了EditYourself，一种基于DiT的音频驱动视频到视频（V2V）编辑框架，该框架能够基于脚本修改对话头视频，包括无缝添加、删除和调整视觉讲话内容的时间。基于通用视频扩散模型，EditYourself通过音频条件和区域感知、编辑导向的训练扩展增强了其V2V能力。这使得通过时空修补技术精确地实现唇部同步和时间连贯的重新结构化现有表演成为可能，包括在新添加的段落中合成现实的人体运动，同时保持视觉保真度和身份一致性。这项工作代表了生成视频模型作为专业视频后期制作实用工具的基础性一步。

Summary / 总结

EditYourself is a DiT-based framework designed for audio-driven editing of talking head videos, enabling modifications such as adding, removing, or retiming spoken content while preserving lip synchronization and temporal coherence. The system uses a video diffusion model with audio conditioning and edit-focused training to maintain visual fidelity and speaker identity over long durations, representing a step towards practical generative video tools for professional video post-production.

EditYourself 是一个基于 DiT 的框架，用于音频驱动的对话头视频编辑，支持添加、删除或调整口语内容，同时保持唇部同步和时间连贯性。该系统利用带有音频条件和编辑重点训练的视频扩散模型，以保持长时间内的视觉保真度和说话人身份一致性，代表了向实用生成视频工具迈进的一步，适用于专业视频后期制作。

A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine

Authors: Anran Li, Yuanyuan Chen, Wenjun Long, Yu Yin, Yan Hu, Hyunjae Kim, Weipeng Zhou, Yujia Zhou, Hongyi Peng, Yang Ren, Xuguang Ai, Zhenyue Qin, Ming Hu, Xiaoxiao Li, Han Yu, Yih-Chung Tham, Lucila Ohno-Machado, Hua Xu, Qingyu Chen

First: 2026-01-29T18:48:21+00:00 · Latest: 2026-01-29T18:48:21+00:00

Comments: 38 pages, 9 tables, 3 figures

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis. To enable their use in clinical settings, LLMs are typically further adapted through continued pretraining or post-training using clinical data. However, most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems. Federated learning (FL) is a promising solution for enabling collaborative model development across healthcare institutions. Yet applying FL to LLMs in medicine remains fundamentally limited. First, conventional FL requires transmitting the full model during each communication round, which becomes impractical for multi-billion-parameter LLMs given the limited computational resources. Second, many FL algorithms implicitly assume data homogeneity, whereas real-world clinical data are highly heterogeneous across patients, diseases, and institutional practices. We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications. Fed-MedLoRA transmits only low-rank adapter parameters, reducing communication and computation overhead, while Fed-MedLoRA+ further incorporates adaptive, data-aware aggregation to improve convergence under cross-site heterogeneity. We apply the framework to clinical information extraction (IE), which transforms patient narratives into structured medical entities and relations. Accuracy was assessed across five patient cohorts through comparisons with BERT models, and LLaMA-3 and DeepSeek-R1, GPT-4o models. Evaluation settings included (1) in-domain training and testing, (2) external validation on independent cohorts, and (3) a low-resource new-site adaptation scenario using real-world clinical notes from the Yale New Haven Health System.

中文标题/摘要

标题：一种用于医学领域大型语言模型训练的联邦和参数高效框架

大型语言模型（LLMs）在医学基准测试中表现出强大的性能，包括问答和诊断。为了在临床环境中使用这些模型，LLMs通常通过继续预训练或使用临床数据进行后训练进一步适应。然而，大多数医学LLMs仅在单一机构的数据上进行训练，这在通用性和异构系统中的安全性方面存在局限性。联邦学习（FL）是一种跨医疗机构进行协作模型开发的有前途的解决方案。然而，将FL应用于医学中的LLMs仍然存在根本限制。首先，传统的FL要求在每次通信轮次中传输完整的模型，对于多十亿参数的LLMs来说，这在计算资源有限的情况下变得不切实际。其次，许多FL算法隐含地假设数据同质性，而现实世界的临床数据在患者、疾病和机构实践方面高度异质。我们提出了一个适用于医学应用的模型通用且参数高效的联邦学习框架。Fed-MedLoRA仅传输低秩适配器参数，减少通信和计算开销，而Fed-MedLoRA+进一步结合了适应性和数据感知聚合，以在跨站点异质性下提高收敛性。我们应用该框架到临床信息提取（IE），将患者叙述转换为结构化的医学实体和关系。通过与BERT模型、LLaMA-3和DeepSeek-R1以及GPT-4o模型的比较，评估了准确性，并在五个患者队列中进行了评估。评估设置包括（1）领域内训练和测试，（2）独立队列的外部验证，以及（3）使用耶鲁纽黑文健康系统的真实世界临床笔记进行的低资源新站点适应场景。

Summary / 总结

This paper addresses the challenge of training large language models (LLMs) for medical applications using federated learning (FL) to overcome limitations in data homogeneity and computational resources. The authors introduce Fed-MedLoRA and Fed-MedLoRA+, which transmit only low-rank adapter parameters, reducing communication and computation overhead. The framework was evaluated on clinical information extraction tasks, showing improved accuracy across various patient cohorts compared to other models like BERT, LLaMA-3, and DeepSeek-R1, GPT-4o.

该论文旨在通过联邦学习（FL）训练大型语言模型（LLMs）以医疗应用为目标，解决数据异质性和模型规模性的问题。作者提出了Fed-MedLoRA和Fed-MedLoRA+，仅传输低秩适配器参数，减少通信和计算开销。该框架在临床信息提取任务上进行了评估，展示了相比BERT、LLaMA-3和DeepSeek-R1等现有模型更高的准确性，即使在低资源场景下也能取得良好效果。

SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence

Authors: Saoud Aldowaish, Yashwanth Karumanchi, Kai-Chen Chiang, Soroosh Noorzad, Morteza Fayazi

First: 2026-01-29T18:41:52+00:00 · Latest: 2026-01-29T18:41:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Current methods for converting circuit schematic images into machine-readable netlists struggle with component recognition and connectivity inference. In this paper, we present SINA, an open-source, fully automated circuit schematic image-to-netlist generator. SINA integrates deep learning for accurate component detection, Connected-Component Labeling (CCL) for precise connectivity extraction, and Optical Character Recognition (OCR) for component reference designator retrieval, while employing a Vision-Language Model (VLM) for reliable reference designator assignments. In our experiments, SINA achieves 96.47% overall netlist-generation accuracy, which is 2.72x higher than state-of-the-art approaches.

中文标题/摘要

标题：SINA：使用人工智能的电路原理图图像到网表生成器

当前将电路原理图图像转换为机器可读网表的方法在组件识别和连接推理方面存在困难。在本文中，我们介绍了SINA，这是一个开源的全自动电路原理图图像到网表生成器。SINA结合了深度学习进行准确的组件检测、连通组件标记（CCL）进行精确的连接提取以及光学字符识别（OCR）进行组件参考标识符检索，同时采用视觉语言模型（VLM）进行可靠的参考标识符分配。在我们的实验中，SINA的整体网表生成准确率为96.47%，比最先进的方法高2.72倍。

Summary / 总结

SINA is an open-source tool that uses artificial intelligence to convert circuit schematic images into machine-readable netlists. It combines deep learning for component detection, CCL for connectivity extraction, OCR for reference designator retrieval, and a VLM for reference designator assignments. The experiments show that SINA achieves 96.47% overall netlist-generation accuracy, which is 2.72 times higher than existing methods.

SINA 是一种自动化的电路原理图图像到网表生成器，它利用深度学习进行组件检测、CCL 进行连接提取、OCR 进行参考标识符检索，以及 VLM 进行参考标识符分配。它实现了 96.47% 的整体网表生成准确率，比现有方法高 2.72 倍。

Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

Authors: Qian Cao, Xiting Wang, Yuzhuo Yuan, Yahui Liu, Fang Luo, Ruihua Song

Venue: ICLR 2026

First: 2025-05-25T17:25:23+00:00 · Latest: 2026-01-29T18:38:43+00:00

Comments: Accepted by ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, we propose a novel pairwise-comparison framework for assessing textual creativity that leverages shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human and synthetic data to train highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs.

中文标题/摘要

标题：跨领域评估文本创造力：一个数据集和大型语言模型评估器

大型语言模型（LLMs）的创造力评估仍然是一个具有挑战性的前沿领域。当前的评估主要依赖于低效且昂贵的人类判断，阻碍了增强机器创造力的进步。虽然存在自动方法，包括心理测试到启发式或提示基的方法，但它们往往缺乏普适性或与人类判断的对齐。为了解决这些问题，我们提出了一种新的成对比较框架，用于评估文本创造力，该框架利用共享的上下文指令以提高评估一致性。我们引入了CreataSet，这是一个大规模数据集，包含10万多个与人类水平相当和100万多个合成的创造性指令-响应对，覆盖了多种开放域任务。通过在CreataSet上训练，我们开发了一个基于LLM的评估器CrEval。CrEval在与人类判断的对齐方面表现出显著的优势。实验结果强调了整合人类和合成数据以训练高度稳健的评估器的不可或缺的重要性，并展示了CrEval在提升LLM创造力方面的实际用途。

Summary / 总结

The paper addresses the challenge of evaluating text creativity for large language models (LLMs) by proposing a novel pairwise-comparison framework using a large-scale dataset called CreataSet. This dataset includes 100K+ human-level and 1M+ synthetic creative instruction-response pairs across various domains. The authors develop an LLM-based evaluator named CrEval, which shows superior alignment with human judgments compared to existing methods. The experiments highlight the importance of using both human and synthetic data for training robust evaluators and demonstrate CrEval's practical utility in enhancing LLM creativity.

论文提出了一种新的成对比较框架，利用大规模数据集CreataSet评估文本创造力。该框架结合LLM构建的评估器CrEval，在与人类判断的对齐方面优于现有方法。主要发现是，结合人类和合成数据对于训练高度 robust 的评估器至关重要，这些评估器可以提升LLM的创造力。

Value-Based Pre-Training with Downstream Feedback

Authors: Shuqi Ke, Giulia Fanti

First: 2026-01-29T18:38:09+00:00 · Latest: 2026-01-29T18:38:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Can a small amount of verified goal information steer the expensive self-supervised pretraining of foundation models? Standard pretraining optimizes a fixed proxy objective (e.g., next-token prediction), which can misallocate compute away from downstream capabilities of interest. We introduce V-Pretraining: a value-based, modality-agnostic method for controlled continued pretraining in which a lightweight task designer reshapes the pretraining task to maximize the value of each gradient step. For example, consider self-supervised learning (SSL) with sample augmentation. The V-Pretraining task designer selects pretraining tasks (e.g., augmentations) for which the pretraining loss gradient is aligned with a gradient computed over a downstream task (e.g., image segmentation). This helps steer pretraining towards relevant downstream capabilities. Notably, the pretrained model is never updated on downstream task labels; they are used only to shape the pretraining task. Under matched learner update budgets, V-Pretraining of 0.5B--7B language models improves reasoning (GSM8K test Pass@1) by up to 18% relative over standard next-token prediction using only 12% of GSM8K training examples as feedback. In vision SSL, we improve the state-of-the-art results on ADE20K by up to 1.07 mIoU and reduce NYUv2 RMSE while improving ImageNet linear accuracy, and we provide pilot evidence of improved token efficiency in continued pretraining.

中文标题/摘要

标题：基于价值的预训练与下游反馈

少量验证目标信息能否引导基础模型昂贵的自监督预训练？标准预训练优化固定代理目标（例如，下一个词预测），这可能会将计算资源错误地分配到不感兴趣的下游能力上。我们提出了V-预训练：一种基于价值、跨模态的方法，通过轻量级任务设计师调整预训练任务，以最大化每个梯度步的价值。例如，考虑样本增强的自监督学习（SSL）。V-预训练任务设计师选择预训练任务（例如，增强方法），使得预训练损失梯度与下游任务（例如，图像分割）的梯度对齐。这有助于引导预训练朝着相关的下游能力。值得注意的是，预训练模型从未在下游任务标签上更新；它们仅用于塑造预训练任务。在匹配学习者更新预算下，使用仅12%的GSM8K训练示例作为反馈，V-预训练0.5B-7B语言模型在GSM8K测试Pass@1上相对改进了高达18%，优于标准的下一个词预测。在视觉SSL中，我们通过最多1.07 mIoU改进了ADE20K的最新结果，同时减少NYUv2 RMSE并提高ImageNet线性精度，我们还提供了持续预训练中改进的标记效率的初步证据。

Summary / 总结

The research aims to address the issue of misallocation of computational resources during self-supervised pretraining of foundation models. V-Pretraining is introduced as a method that uses a lightweight task designer to reshape the pretraining task based on downstream task value, steering the pretraining towards relevant capabilities. Experiments show that V-Pretraining, using only 12% of downstream task examples as feedback, improves GSM8K test Pass@1 by up to 18% for 0.5B-7B language models and enhances vision SSL results on ADE20K and NYUv2, while improving ImageNet linear accuracy and token efficiency in continued pretraining.

研究旨在解决自监督预训练过程中计算资源分配不当的问题。V-预训练方法通过轻量级任务设计师重新塑造预训练任务，使其更符合下游任务的价值，从而引导预训练向相关能力靠拢。实验表明，使用仅12%的下游任务示例作为反馈，V-预训练可将0.5B-7B语言模型的GSM8K测试Pass@1提高18%，并在视觉自监督学习中提升ADE20K和NYUv2的结果，同时提高ImageNet线性准确率和持续预训练中的标记效率。

ECO: Quantized Training without Full-Precision Master Weights

Authors: Mahdi Nikdan, Amir Zandieh, Dan Alistarh, Vahab Mirrokni

First: 2026-01-29T18:35:01+00:00 · Latest: 2026-01-29T18:35:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in high-precision: concretely, gradient updates must be applied to a high-precision weight buffer, known as $\textit{master weights}$. This buffer introduces substantial memory overhead, particularly for Sparse Mixture of Experts (SMoE) models, where model parameters and optimizer states dominate memory usage. To address this, we introduce the Error-Compensating Optimizer (ECO), which eliminates master weights by applying updates directly to quantized parameters. ECO quantizes weights after each step and carefully injects the resulting quantization error into the optimizer momentum, forming an error-feedback loop with no additional memory. We prove that, under standard assumptions and a decaying learning rate, ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate. We show empirical results for pretraining small Transformers (30-800M), a Gemma-3 1B model, and a 2.1B parameter Sparse MoE model with FP8 quantization, and fine-tuning DeepSeek-MoE-16B in INT4 precision. Throughout, ECO matches baselines with master weights up to near-lossless accuracy, significantly shifting the static memory vs validation loss Pareto frontier.

中文标题/摘要

标题：ECO：无需全精度主权重的量化训练

量化显著提高了大型语言模型（LLM）训练的计算和内存效率。然而，现有方法仍然依赖于在高精度下累积更新：具体来说，梯度更新必须应用于高精度权重缓冲区，称为“主权重”。该缓冲区引入了大量内存开销，特别是在稀疏混合专家（SMoE）模型中，模型参数和优化器状态占主导地位。为了解决这个问题，我们引入了误差补偿优化器（ECO），它通过直接对量化参数应用更新来消除主权重。ECO 在每一步后量化权重，并仔细将由此产生的量化误差注入优化器动量中，形成一个无需额外内存的误差反馈循环。我们证明，在标准假设和衰减的学习率下，ECO 收敛到最优解的恒定半径邻域，而简单的主权重移除可能会导致与学习率成反比的误差。我们展示了针对小型Transformer（30-800M）、Gemma-3 1B 模型和使用FP8量化参数的2.1B参数稀疏MoE模型的预训练结果，以及在INT4精度下对DeepSeek-MoE-16B的微调结果。在整个过程中，ECO 在主权重基线上的准确率几乎无损，显著地改变了静态内存与验证损失帕累托前沿。

Summary / 总结

The paper addresses the memory overhead introduced by high-precision master weights in quantized training of Large Language Models (LLMs). It introduces the Error-Compensating Optimizer (ECO), which eliminates the need for master weights by directly applying updates to quantized parameters. ECO quantizes weights after each step and injects quantization error into the optimizer momentum, achieving convergence to the optimum with no additional memory cost. Experiments show that ECO matches baseline accuracy with master weights up to near-lossless accuracy, improving the memory efficiency of LLM training without compromising performance.

研究旨在通过消除高精度主权重来提高大型语言模型训练的内存效率。错误补偿优化器（ECO）直接对量化参数进行更新，并将量化误差注入优化器动量中。实验表明，ECO在与具有主权重的基线相比几乎不损失准确性的同时，显著减少了内存使用量，而不影响性能。

RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation

Authors: Hanzhuo Huang, Qingyang Bao, Zekai Gu, Zhongshuo Du, Cheng Lin, Yuan Liu, Sibei Yang

Venue: ICLR 2026

First: 2026-01-29T18:30:10+00:00 · Latest: 2026-01-29T18:30:10+00:00

Comments: ICLR 2026. Project page: https://judgementh.github.io/RefAny3D Codes: https://github.com/JudgementH/RefAny3D

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

In this paper, we propose a 3D asset-referenced diffusion model for image generation, exploring how to integrate 3D assets into image diffusion models. Existing reference-based image generation methods leverage large-scale pretrained diffusion models and demonstrate strong capability in generating diverse images conditioned on a single reference image. However, these methods are limited to single-image references and cannot leverage 3D assets, constraining their practical versatility. To address this gap, we present a cross-domain diffusion model with dual-branch perception that leverages multi-view RGB images and point maps of 3D assets to jointly model their colors and canonical-space coordinates, achieving precise consistency between generated images and the 3D references. Our spatially aligned dual-branch generation architecture and domain-decoupled generation mechanism ensure the simultaneous generation of two spatially aligned but content-disentangled outputs, RGB images and point maps, linking 2D image attributes with 3D asset attributes. Experiments show that our approach effectively uses 3D assets as references to produce images consistent with the given assets, opening new possibilities for combining diffusion models with 3D content creation.

中文标题/摘要

标题：RefAny3D：基于3D资产的扩散模型图像生成

在本文中，我们提出了一种基于3D资产的扩散模型，探索如何将3D资产整合到图像扩散模型中。现有的基于参考的图像生成方法利用大规模预训练的扩散模型，并展示了在单张参考图像条件下生成多样化图像的强大能力。然而，这些方法仅限于单张图像参考，无法利用3D资产，限制了其实用的灵活性。为了解决这一差距，我们提出了一种跨域扩散模型，具有双分支感知机制，利用多视角RGB图像和3D资产的点图来联合建模它们的颜色和标准空间坐标，实现了生成图像与3D参考之间的精确一致性。我们的空间对齐双分支生成架构和域解耦生成机制确保了同时生成两个空间对齐但内容分离的输出，RGB图像和点图，将2D图像属性与3D资产属性联系起来。实验表明，我们的方法有效地利用3D资产作为参考，生成与给定资产一致的图像，为将扩散模型与3D内容创作相结合开辟了新的可能性。

Summary / 总结

This paper introduces RefAny3D, a 3D asset-referenced diffusion model for image generation. It addresses the limitation of existing methods that can only use single-image references by integrating multi-view RGB images and point maps of 3D assets. The model achieves precise consistency between generated images and 3D references through a spatially aligned dual-branch generation architecture and a domain-decoupled generation mechanism. Experiments demonstrate that RefAny3D effectively utilizes 3D assets as references to produce images consistent with the given assets, enhancing the practical versatility of image generation models.

本文提出了RefAny3D，一种基于3D资产的扩散模型，旨在通过整合3D资产来提升图像生成能力。该模型解决了现有方法依赖单一图像参考且无法利用3D资产的局限性。模型采用双分支感知机制，结合多视角RGB图像和3D资产的点图，生成与3D参考精确一致的图像。实验表明，RefAny3D能够有效利用3D资产作为参考，拓展了扩散模型在3D内容创作中的应用潜力。

NeuroFaith: Evaluating LLM Self-Explanation Faithfulness via Internal Representation Alignment

Authors: Milan Bhan, Jean-Noel Vittaut, Nicolas Chesneau, Sarath Chandar, Marie-Jeanne Lesot

First: 2025-06-10T22:30:53+00:00 · Latest: 2026-01-29T18:24:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) can generate plausible free text self-explanations to justify their answers. However, these natural language explanations may not accurately reflect the model's actual reasoning process, pinpointing a lack of faithfulness. Existing faithfulness evaluation methods rely primarily on behavioral tests or computational block analysis without examining the semantic content of internal neural representations. This paper proposes NeuroFaith, a flexible framework that measures the faithfulness of LLM free text self-explanation by identifying key concepts within explanations and mechanistically testing whether these concepts actually influence the model's predictions. We show the versatility of NeuroFaith across 2-hop reasoning and classification tasks. Additionally, we develop a linear faithfulness probe based on NeuroFaith to detect unfaithful self-explanations from representation space and improve faithfulness through steering. NeuroFaith provides a principled approach to evaluating and enhancing the faithfulness of LLM free text self-explanations, addressing critical needs for trustworthy AI systems.

中文标题/摘要

标题：NeuroFaith：通过内部表示一致性评估LLM自我解释的忠实性

大型语言模型（LLMs）可以生成可信的自然语言自我解释来证明其答案。然而，这些自然语言解释可能并不准确反映模型的实际推理过程，揭示了忠实性不足的问题。现有的忠实性评估方法主要依赖于行为测试或计算块分析，而不检查内部神经表示的语义内容。本文提出了一种名为NeuroFaith的灵活框架，通过识别解释中的关键概念并机械性地测试这些概念是否实际上影响模型的预测来衡量LLM自由文本自我解释的忠实性。我们展示了NeuroFaith在2跳推理和分类任务中的灵活性。此外，我们基于NeuroFaith开发了一种线性忠实性探针，用于从表示空间检测不忠实的自我解释，并通过引导提高忠实性。NeuroFaith提供了一种评估和增强LLM自由文本自我解释忠实性的原则性方法，满足了对可信赖AI系统的关键需求。

Summary / 总结

The research aims to evaluate the faithfulness of LLM self-explanations by aligning internal representations. The method involves identifying key concepts in the explanations and testing whether these concepts influence the model's predictions. Key findings include the framework's versatility across different tasks and the development of a linear faithfulness probe to detect and improve unfaithful self-explanations.

研究旨在通过内部表示的对齐来评估LLM自我解释的忠实性。方法是识别解释中的关键概念，并测试这些概念是否影响模型的预测。主要发现包括该框架在不同任务中的灵活性以及开发了一种线性忠实性探针来检测和改进不忠实的自我解释。

Think Locally, Explain Globally: Graph-Guided LLM Investigations via Local Reasoning and Belief Propagation

Authors: Saurabh Jha, Rohan Arora, Bhavya, Noah Zheutlin, Paulina Toro Isaza, Laura Shwartz, Yu Deng, Daby Sow, Ruchi Mahindru, Ruchir Puri

First: 2026-01-25T17:27:19+00:00 · Latest: 2026-01-29T18:18:39+00:00

Abs · PDF · Code1 · Code2

Abstract

LLM agents excel when environments are mostly static and the needed information fits in a model's context window, but they often fail in open-ended investigations where explanations must be constructed by iteratively mining evidence from massive, heterogeneous operational data. These investigations exhibit hidden dependency structure: entities interact, signals co-vary, and the importance of a fact may only become clear after other evidence is discovered. Because the context window is bounded, agents must summarize intermediate findings before their significance is known, increasing the risk of discarding key evidence. ReAct-style agents are especially brittle in this regime. Their retrieve-summarize-reason loop makes conclusions sensitive to exploration order and introduces run-to-run non-determinism, producing a reliability gap where Pass-at-k may be high but Majority-at-k remains low. Simply sampling more rollouts or generating longer reasoning traces does not reliably stabilize results, since hypotheses cannot be autonomously checked as new evidence arrives and there is no explicit mechanism for belief bookkeeping and revision. In addition, ReAct entangles semantic reasoning with controller duties such as tool orchestration and state tracking, so execution errors and plan drift degrade reasoning while consuming scarce context. We address these issues by formulating investigation as abductive reasoning over a dependency graph and proposing EoG (Explanations over Graphs), a disaggregated framework in which an LLM performs bounded local evidence mining and labeling (cause vs symptom) while a deterministic controller manages traversal, state, and belief propagation to compute a minimal explanatory frontier. On a representative ITBench diagnostics task, EoG improves both accuracy and run-to-run consistency over ReAct baselines, including a 7x average gain in Majority-at-k entity F1.

中文标题/摘要

标题：就近思考，全球解释：通过局部推理和信念传播的图引导LLM调查

当环境基本稳定且所需信息可以容纳在模型的上下文窗口中时，LLM代理表现出色，但在需要从大量异构操作数据中迭代挖掘证据进行开放式调查时，它们经常失败。这些调查表现出隐藏的依赖结构：实体相互作用，信号共变，一个事实的重要性可能在发现其他证据后才变得清晰。由于上下文窗口是有限的，代理必须在已知其重要性之前总结中间发现，增加了丢弃关键证据的风险。ReAct风格的代理在这种情况下尤其脆弱。它们的检索-总结-推理循环使得结论对探索顺序敏感，并引入了运行间非确定性，导致可靠性差距，即使Pass-at-k可能很高，但Majority-at-k仍然较低。简单地增加采样次数或生成更长的推理轨迹并不能可靠地稳定结果，因为随着新证据的到来，假设无法自主验证，也没有明确的机制来记录和修订信念。此外，ReAct将语义推理与控制器职责（如工具编排和状态跟踪）纠缠在一起，执行错误和计划漂移会降低推理能力，消耗稀缺的上下文。我们通过将调查形式化为依赖图上的 abduction 推理，并提出 EoG（图上的解释），一种分解框架来解决这些问题，在该框架中，LLM执行有界的局部证据挖掘和标记（原因 vs 症状），而确定性控制器管理遍历、状态和信念传播以计算最小的解释前沿。在代表性的ITBench诊断任务中，EoG在准确性和运行间一致性方面都优于ReAct基线，包括平均F1得分提高了7倍的Majority-at-k实体。

Summary / 总结

The paper addresses the limitations of LLM agents in open-ended investigations by proposing EoG (Explanations over Graphs), which disaggregates the reasoning process. EoG uses a dependency graph to guide local reasoning and belief propagation, allowing LLMs to mine and label evidence while a deterministic controller manages traversal and state updates. On an ITBench diagnostics task, EoG outperforms ReAct baselines, achieving a 7x improvement in Majority-at-k entity F1.

该论文针对大型语言模型（LLMs）在需要迭代证据挖掘的开放性调查中的局限性，提出了EoG（基于图的解释）框架，该框架将语义推理与控制器职责分离，使用局部推理和信念传播。实验结果显示，EoG在ITBench诊断任务上的准确性和运行一致性都优于ReAct基线，平均提高了7倍的Majority-at-k实体F1分数。

Where Do the Joules Go? Diagnosing Inference Energy Consumption

Authors: Jae-Won Chung, Ruofan Wu, Jeff J. Ma, Mosharaf Chowdhury

First: 2026-01-29T18:16:45+00:00 · Latest: 2026-01-29T18:16:45+00:00

Comments: The ML.ENERGY Leaderboard v3.0 is open https://ml.energy/leaderboard

Abs · PDF · Code1 · Code2

Abstract

Energy is now a critical ML computing resource. While measuring energy consumption and observing trends is a valuable first step, accurately understanding and diagnosing why those differences occur is crucial for optimization. To that end, we begin by presenting a large-scale measurement study of inference time and energy across the generative AI landscape with 46 models, 7 tasks, and 1,858 different configurations on NVIDIA H100 and B200 GPUs. Our empirical findings span order-of-magnitude variations: LLM task type can lead to 25$\times$ energy differences, video generation sometimes consumes more than 100$\times$ the energy of images, and GPU utilization differences can result in 3--5$\times$ energy differences. Based on our observations, we present a framework for reasoning about the underlying mechanisms that govern time and energy consumption. The essence is that time and energy are determined by latent metrics like memory and utilization, which are in turn affected by various factors across the algorithm, software, and hardware layers. Our framework also extends directly to throughput per watt, a critical metric for power-constrained datacenters.

中文标题/摘要

标题：能量去向何方？诊断推理能耗消耗

能源现在是关键的ML计算资源。虽然测量能耗和观察趋势是一个有价值的初步步骤，但准确地理解并诊断这些差异发生的原因对于优化至关重要。为此，我们首先通过在NVIDIA H100和B200 GPU上对46个模型、7个任务和1,858种不同配置进行大规模测量研究，展示了生成AI领域的推理时间和能耗情况。我们的实证发现涵盖了数量级的差异：LLM任务类型可能导致25倍的能耗差异，视频生成有时消耗超过100倍的图像能耗，而GPU利用率差异可能导致3到5倍的能耗差异。基于我们的观察，我们提出了一种关于时间与能耗消耗背后机制的推理框架。核心在于时间与能耗由潜在指标如内存和利用率决定，而这些指标又受到算法、软件和硬件各层因素的影响。我们的框架还直接扩展到每瓦吞吐量，这是受电力限制的数据中心的关键指标。

Summary / 总结

This study investigates the energy consumption of inference tasks across various generative AI models and tasks, measuring 1,858 configurations on NVIDIA H100 and B200 GPUs. Key findings include order-of-magnitude variations in energy consumption, with LLM tasks consuming up to 25 times more energy than others, and video generation using more than 100 times the energy of image generation. The research proposes a framework to understand the underlying mechanisms affecting time and energy consumption, which can also be applied to throughput per watt, a critical metric for power-constrained datacenters.

该研究旨在理解影响生成AI模型推理能耗的因素。通过在NVIDIA H100和B200 GPU上对46个模型和1,858种配置进行能耗测量，研究人员发现了显著的能耗差异，LLM任务的能耗最高可达到其他任务的25倍，而视频生成的能耗则比图像生成高出100多倍。研究提出了一个框架，将能耗与内存和利用率等底层指标联系起来，这些指标受算法、软件和硬件层面上多种因素的影响，并将此框架扩展到每瓦吞吐量，这是对功率受限数据中心至关重要的一个指标。

PRISM: A Framework Harnessing Unsupervised Visual Representations and Textual Prompts for Explainable MACE Survival Prediction from Cardiac Cine MRI

Authors: Haoyang Su, Jin-Yi Xiang, Shaohao Rui, Yifan Gao, Xingyu Chen, Tingxuan Yin, Shaoting Zhang, Xiaosong Wang, Lian-Ming Wu

First: 2025-08-26T17:23:43+00:00 · Latest: 2026-01-29T18:11:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Accurate prediction of major adverse cardiac events (MACE) remains a central challenge in cardiovascular prognosis. We present PRISM (Prompt-guided Representation Integration for Survival Modeling), a self-supervised framework that integrates visual representations from non-contrast cardiac cine magnetic resonance imaging with structured electronic health records (EHRs) for survival analysis. PRISM extracts temporally synchronized imaging features through motion-aware multi-view distillation and modulates them using medically informed textual prompts to enable fine-grained risk prediction. Across four independent clinical cohorts, PRISM consistently surpasses classical survival prediction models and state-of-the-art (SOTA) deep learning baselines under internal and external validation. Further clinical findings demonstrate that the combined imaging and EHR representations derived from PRISM provide valuable insights into cardiac risk across diverse cohorts. Three distinct imaging signatures associated with elevated MACE risk are uncovered, including lateral wall dyssynchrony, inferior wall hypersensitivity, and anterior elevated focus during diastole. Prompt-guided attribution further identifies hypertension, diabetes, and smoking as dominant contributors among clinical and physiological EHR factors.

中文标题/摘要

标题：PRISM：一种利用无监督视觉表示和文本提示进行可解释的MACE生存预测的心脏 cine MRI框架

准确预测主要不良心脏事件（MACE）仍然是心血管预后中的一个核心挑战。我们提出了PRISM（提示引导的表示集成生存建模），这是一种自监督框架，将非对比心脏cine磁共振成像的视觉表示与结构化的电子健康记录（EHRs）结合用于生存分析。PRISM通过运动感知的多视图蒸馏提取时间同步的成像特征，并使用医学上知情的文本提示对其进行调节，以实现细粒度的风险预测。在四个独立的临床队列中，PRISM在内部和外部验证中始终优于经典的生存预测模型和最先进的（SOTA）深度学习基线。进一步的临床发现表明，PRISM提取的成像和EHR表示为不同队列中的心脏风险提供了有价值的见解。发现了三种与MACE风险增加相关的不同成像特征，包括侧壁异步、下壁高敏感性和舒张期前壁升高焦点。提示引导的归因进一步确定了高血压、糖尿病和吸烟是临床和生理EHR因素中的主要贡献者。

Summary / 总结

PRISM is a self-supervised framework that integrates visual representations from cardiac cine MRI with EHRs for MACE prediction. It uses motion-aware multi-view distillation and textual prompts to extract synchronized imaging features and modulate them for fine-grained risk prediction. PRISM outperforms classical and SOTA models in four clinical cohorts, providing valuable insights into cardiac risk with three imaging signatures and identifying key EHR factors such as hypertension, diabetes, and smoking.

PRISM 是一个自监督框架，结合心脏 cine MRI 的视觉特征和 EHRs 来预测 MACE 生存率。它使用运动感知的多视图蒸馏和文本提示来提取和调节影像特征，性能优于传统和深度学习模型。关键发现包括与 MACE 风险相关的三种影像特征以及高血压、糖尿病和吸烟作为主要临床因素对心脏风险的贡献。

History

20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553