RedSage: A Cybersecurity Generalist LLM
Authors: Naufal Suryanto, Muzammal Naseer, Pengfei Li, Syed Talal Wasim, Jinhui Yi, Juergen Gall, Paolo Ceravolo, Ernesto Damiani
Venue: ICLR 2026
First: 2026-01-29T18:59:57+00:00 · Latest: 2026-01-29T18:59:57+00:00
Comments: Accepted on ICLR 2026; Project page: https://risys-lab.github.io/RedSage/
Abstract
Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain adaptation. To bridge this gap, we curate 11.8B tokens of cybersecurity-focused continual pretraining data via large-scale web filtering and manual collection of high-quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools. Building on this, we design an agentic augmentation pipeline that simulates expert workflows to generate 266K multi-turn cybersecurity samples for supervised fine-tuning. Combined with general open-source LLM data, these resources enable the training of RedSage, an open-source, locally deployable cybersecurity assistant with domain-aware pretraining and post-training. To rigorously evaluate the models, we introduce RedSage-Bench, a benchmark with 30K multiple-choice and 240 open-ended Q&A items covering cybersecurity knowledge, skills, and tool expertise. RedSage is further evaluated on established cybersecurity benchmarks (e.g., CTI-Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization. At the 8B scale, RedSage achieves consistently better results, surpassing the baseline models by up to +5.59 points on cybersecurity benchmarks and +5.05 points on Open LLM Leaderboard tasks. These findings demonstrate that domain-aware agentic augmentation and pre/post-training can not only enhance cybersecurity-specific expertise but also help to improve general reasoning and instruction-following. All models, datasets, and code are publicly available.
中文标题/摘要
标题:RedSage:网络安全通才大语言模型
网络安全操作需要能够支持多样化工作流程而不泄露敏感数据的辅助大语言模型。现有解决方案要么依赖于存在隐私风险的专有API,要么是缺乏领域适应性的开源模型。为弥合这一差距,我们通过大规模网络过滤和手动收集高质量资源,整理了118亿个与网络安全相关的持续预训练数据,覆盖了28600份文档,涉及框架、攻击技术和安全工具。在此基础上,我们设计了一种代理增强管道,模拟专家工作流程以生成266000个网络安全多轮对话样本,用于监督微调。结合通用开源大语言模型数据,这些资源使RedSage这一开源、本地可部署的网络安全助手得以训练,具备领域感知的预训练和后训练。为了严格评估模型,我们引入了RedSage-Bench基准,包含30000个多项选择题和240个开放式问答项,涵盖网络安全知识、技能和工具专长。RedSage还在建立的网络安全基准(如CTI-Bench、CyberMetric、SECURE)和通用大语言模型基准上进行了评估,以评估其更广泛的泛化能力。在80亿规模下,RedSage在网络安全基准上取得了持续更好的结果,比基线模型在网络安全基准上高出最多+5.59分,在Open LLM Leaderboard任务上高出+5.05分。这些发现表明,领域感知的代理增强和预/后训练不仅能增强网络安全特定的专业知识,还能帮助提高一般推理和指令遵循能力。所有模型、数据集和代码均已公开。
Summary / 总结
RedSage is designed to support diverse cybersecurity workflows while maintaining data privacy. It leverages 11.8B tokens of curated cybersecurity data and an agentic augmentation pipeline to generate 266K multi-turn samples for fine-tuning. RedSage outperforms baseline models by up to 5.59 points on cybersecurity benchmarks and 5.05 points on general LLM tasks, demonstrating the effectiveness of domain-aware pre/post-training and agentic augmentation. The model is open-source and publicly available.
RedSage旨在支持多样化的网络安全工作流程同时保护敏感数据。它利用了11.8B个经过筛选的网络安全数据令牌和一个代理增强管道来生成266K样本进行微调。RedSage在网络安全基准测试中的表现比基线模型高出最多5.59分,在通用大模型基准测试中的表现高出5.05分,显示出在领域特定和通用推理能力方面的改进。
Discovering Hidden Gems in Model Repositories
Authors: Jonathan Kahana, Eliahu Horwitz, Yedid Hoshen
First: 2026-01-29T18:59:55+00:00 · Latest: 2026-01-29T18:59:55+00:00
Abstract
Public repositories host millions of fine-tuned models, yet community usage remains disproportionately concentrated on a small number of foundation checkpoints. We investigate whether this concentration reflects efficient market selection or if superior models are systematically overlooked. Through an extensive evaluation of over 2,000 models, we show the prevalence of "hidden gems", unpopular fine-tunes that significantly outperform their popular counterparts. Notably, within the Llama-3.1-8B family, we find rarely downloaded checkpoints that improve math performance from 83.2% to 96.0% without increasing inference costs. However, discovering these models through exhaustive evaluation of every uploaded model is computationally infeasible. We therefore formulate model discovery as a Multi-Armed Bandit problem and accelerate the Sequential Halving search algorithm by using shared query sets and aggressive elimination schedules. Our method retrieves top models with as few as 50 queries per candidate, accelerating discovery by over 50x.
中文标题/摘要
标题:在模型仓库中发现隐藏瑰宝
公共仓库托管着数百万个微调模型,但社区使用却不成比例地集中在少数基础检查点上。我们研究这种集中是否反映了有效的市场选择,还是优秀模型被系统性地忽视了。通过评估超过2,000个模型,我们展示了“隐藏瑰宝”的普遍存在,这些不受欢迎的微调模型显著优于其受欢迎的同类。值得注意的是,在Llama-3.1-8B家族中,我们发现很少下载的检查点在不增加推理成本的情况下将数学性能从83.2%提高到96.0%。然而,通过彻底评估每个上传的模型来发现这些模型在计算上是不可行的。因此,我们将模型发现形式化为一个多臂老虎机问题,并通过使用共享查询集和激进的淘汰计划来加速逐步削减搜索算法。我们的方法在每个候选模型上只需50次查询即可检索到顶级模型,将发现速度加快了50多倍。
Summary / 总结
The research investigates whether the concentration of usage on a few popular models in public repositories reflects efficient market selection or systematic oversight of superior models. Through evaluating over 2,000 models, the study identifies 'hidden gems'—unpopular fine-tuned models that significantly outperform popular ones. Specifically, within the Llama-3.1-8B family, rarely downloaded checkpoints were found to improve math performance from 83.2% to 96.0% without additional costs. To efficiently discover these models, the study formulates model discovery as a Multi-Armed Bandit problem and accelerates the Sequential Halving search algorithm, reducing the number of queries needed per candidate to as few as 50, achieving over 50x acceleration in discovery.
研究旨在识别公共仓库中被忽视但性能优异的模型,这些模型虽然表现更好但使用率较低。研究评估了超过2,000个模型,并发现许多不受欢迎的微调模型优于流行模型。具体来说,在Llama-3.1-8B家族中,很少下载的检查点显著提高了数学性能且无需额外成本。为了高效地发现这些隐藏的瑰宝,研究人员将模型发现问题表述为一个多臂老虎机问题,并优化了顺序削减搜索算法,将每个候选模型所需的查询次数减少到50次,从而将发现过程加速了超过50倍。
Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts
Authors: Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu
First: 2026-01-29T18:59:53+00:00 · Latest: 2026-01-29T18:59:53+00:00
Comments: 20 pages, 8 figures
Abstract
Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data
中文标题/摘要
标题:正确实现的混合线性注意力:高效蒸馏与适用于极长上下文的有效架构
混合Transformer架构结合了softmax注意力块和递归神经网络(RNN),在长上下文建模中表现出理想的性能-吞吐量权衡,但其采用和研究受到大规模从头预训练成本高昂的阻碍。一些最近的研究表明,可以通过参数转移和知识蒸馏将预训练的softmax注意力块转换为RNN块。然而,这些转移方法需要大量的训练数据(超过100亿个标记),并且生成的混合模型在长上下文性能方面表现不佳,这是混合模型相对于基于Transformer的模型在推理速度上有显著优势的场景。在本文中,我们提出了HALO(混合注意力通过层优化),一种将Transformer模型蒸馏为RNN-注意力混合模型的管道。然后,我们提出了HypeNet,一种通过新颖的位置编码方案(名为HyPE)和各种架构修改实现优越长度泛化的混合架构。我们使用HALO将Qwen3系列转换为HypeNet,实现了与原始Transformer模型相当的性能,同时具有优越的长上下文性能和效率。转换只需要23亿个标记,不到其预训练数据的0.01%
Summary / 总结
This paper addresses the challenge of using hybrid Transformer architectures for long-context modeling by presenting HALO, a method for distilling pre-trained softmax attention blocks into RNN blocks through parameter transfer and knowledge distillation. The authors also introduce HypeNet, a hybrid architecture with enhanced long-context performance and efficiency, achieved through a novel position encoding scheme and architectural modifications. The conversion of Qwen3 series into HypeNet using HALO demonstrates comparable performance to the original models while significantly improving long-context performance and efficiency, requiring only 2.3B tokens for conversion, a fraction of their pre-training data.
本文通过提出HALO管道,将预训练的softmax注意力块精简为RNN-注意力混合模型,解决了使用混合Transformer架构进行长上下文建模的挑战。作者还引入了HypeNet,通过新颖的位置编码方案和架构修改增强了长上下文性能。使用HALO将Qwen3系列转换为HypeNet,实现了与原始Transformer模型相当的性能,同时具有更好的长上下文性能和效率,仅需2.3B个令牌进行转换,远少于其预训练数据量。
UEval: A Benchmark for Unified Multimodal Generation
Authors: Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu, Zhuang Liu
First: 2026-01-29T18:59:52+00:00 · Latest: 2026-01-29T18:59:52+00:00
Abstract
We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417 validated rubric criteria, enabling scalable and fine-grained automatic scoring. UEval is challenging for current unified models: GPT-5-Thinking scores only 66.4 out of 100, while the best open-source model reaches merely 49.1. We observe that reasoning models often outperform non-reasoning ones, and transferring reasoning traces from a reasoning model to a non-reasoning model significantly narrows the gap. This suggests that reasoning may be important for tasks requiring complex multimodal understanding and generation.
中文标题/摘要
标题:UEval:统一多模态生成基准
我们介绍了UEval,一个用于评估统一模型的基准,即能够生成图像和文本的模型。UEval包含1000个专家精选的问题,要求模型输出中包含图像和文本,这些问题来源于8个实际任务。我们精选的问题涵盖了广泛的推理类型,从逐步指南到教科书解释。对开放式的多模态生成进行评估并不简单,简单的LLM作为评判者的方法可能会忽略细节。不同于以往依赖多模态大型语言模型(MLLMs)来评估图像质量和文本准确性的工作,我们在UEval中设计了一种基于评分标准的评分系统。对于每个问题,提供参考图像和文本答案给MLLM生成初始评分标准,由人类专家进一步完善和验证这些评分标准。UEval包含10,417个验证过的评分标准,使得自动评分更加高效和精细。UEval对当前的统一模型具有挑战性:GPT-5-Thinking仅得66.4分,而最好的开源模型仅得49.1分。我们观察到,推理模型通常优于非推理模型,从推理模型转移推理痕迹到非推理模型可以显著缩小差距。这表明,对于需要复杂多模态理解和生成的任务,推理可能是重要的。
Summary / 总结
UEval is a benchmark for evaluating unified models that generate both images and text, using 1,000 expert-curated questions from 8 real-world tasks. It employs a rubric-based scoring system, where reference images and text are evaluated by a MLLM to generate initial criteria, which are then refined by human experts. UEval challenges current unified models, with GPT-5-Thinking scoring 66.4 out of 100 and the best open-source model scoring 49.1. The study finds that reasoning models outperform non-reasoning ones, and transferring reasoning from a reasoning model to a non-reasoning model can significantly improve performance.
UEval 是一个用于评估能够生成图像和文本的统一模型的基准,包含来自 8 个真实任务的 1,000 个专家精选问题。它采用基于评分表的评分系统,其中参考图像和文本由 MLLM 生成初始评分标准,然后由人类专家进行细化。UEval 对当前统一模型构成挑战,GPT-5-Thinking 的得分为 66.4,而最佳开源模型的得分为 49.1。该基准显示,推理模型优于非推理模型,从推理模型向非推理模型转移推理痕迹可以显著提高性能。
Exploring Reasoning Reward Model for Agents
Authors: Kaixuan Fan, Kaituo Feng, Manyuan Zhang, Tianshuo Peng, Zhixun Li, Yilei Jiang, Shuang Chen, Peng Pei, Xunliang Cai, Xiangyu Yue
First: 2026-01-29T18:59:52+00:00 · Latest: 2026-01-29T18:59:52+00:00
Comments: Project page: https://github.com/kxfan2002/Reagent
Abstract
Agentic Reinforcement Learning (Agentic RL) has achieved notable success in enabling agents to perform complex reasoning and tool use. However, most methods still relies on sparse outcome-based reward for training. Such feedback fails to differentiate intermediate reasoning quality, leading to suboptimal training results. In this paper, we introduce Agent Reasoning Reward Model (Agent-RRM), a multi-faceted reward model that produces structured feedback for agentic trajectories, including (1) an explicit reasoning trace , (2) a focused critique that provides refinement guidance by highlighting reasoning flaws, and (3) an overall score that evaluates process performance. Leveraging these signals, we systematically investigate three integration strategies: Reagent-C (text-augmented refinement), Reagent-R (reward-augmented guidance), and Reagent-U (unified feedback integration). Extensive evaluations across 12 diverse benchmarks demonstrate that Reagent-U yields substantial performance leaps, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of our reasoning reward model and training schemes. Code, models, and datasets are all released to facilitate future research.
中文标题/摘要
标题:探索代理推理奖励模型
代理强化学习(Agentic RL)在使代理执行复杂推理和工具使用方面取得了显著成功。然而,大多数方法仍然依赖于稀疏的结果奖励进行训练。这种反馈无法区分中间推理的质量,导致训练结果不佳。在本文中,我们引入了代理推理奖励模型(Agent-RRM),这是一种多方面的奖励模型,为代理轨迹提供结构化的反馈,包括(1)明确的推理轨迹,(2)聚焦的批评,通过突出推理缺陷提供改进指导,以及(3)总体评分,评估过程表现。利用这些信号,我们系统地研究了三种集成策略:Reagent-C(文本增强改进),Reagent-R(奖励增强指导)和Reagent-U(统一反馈集成)。在12个不同基准上的广泛评估表明,Reagent-U带来了显著的性能提升,在GAIA上达到43.7%,在WebWalkerQA上达到46.2%,验证了我们推理奖励模型和训练方案的有效性。代码、模型和数据集均已发布,以促进未来的研究。
Summary / 总结
This paper addresses the limitation of sparse outcome-based rewards in Agentic Reinforcement Learning by proposing Agent Reasoning Reward Model (Agent-RRM), which provides structured feedback including reasoning traces, focused critiques, and overall scores. Three integration strategies—Reagent-C, Reagent-R, and Reagent-U—are evaluated, with Reagent-U showing significant performance improvements, achieving 43.7% on GAIA and 46.2% on WebWalkerQA, validating the effectiveness of the reasoning reward model and training schemes.
本文针对代理强化学习中稀疏结果奖励的局限性,提出了代理推理奖励模型(Agent-RRM),该模型提供结构化反馈,包括推理轨迹、重点批评和整体评分。三种集成策略——Reagent-C、Reagent-R和Reagent-U——被评估,其中Reagent-U显示出显著的性能提升,分别在GAIA和WebWalkerQA上达到43.7%和46.2%,验证了推理奖励模型和训练方案的有效性。
DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation
Authors: Haozhe Xie, Beichen Wen, Jiarui Zheng, Zhaoxi Chen, Fangzhou Hong, Haiwen Diao, Ziwei Liu
Venue: www
First: 2026-01-29T18:59:51+00:00 · Latest: 2026-01-29T18:59:51+00:00
Comments: Project Page: https://www.infinitescript.com/project/dynamic-vla/ GitHub: https://github.com/hzxie/DynamicVLA
Abstract
Manipulating dynamic objects remains an open challenge for Vision-Language-Action (VLA) models, which, despite strong generalization in static manipulation, struggle in dynamic scenarios requiring rapid perception, temporal anticipation, and continuous control. We present DynamicVLA, a framework for dynamic object manipulation that integrates temporal reasoning and closed-loop adaptation through three key designs: 1) a compact 0.4B VLA using a convolutional vision encoder for spatially efficient, structurally faithful encoding, enabling fast multimodal inference; 2) Continuous Inference, enabling overlapping reasoning and execution for lower latency and timely adaptation to object motion; and 3) Latent-aware Action Streaming, which bridges the perception-execution gap by enforcing temporally aligned action execution. To fill the missing foundation of dynamic manipulation data, we introduce the Dynamic Object Manipulation (DOM) benchmark, built from scratch with an auto data collection pipeline that efficiently gathers 200K synthetic episodes across 2.8K scenes and 206 objects, and enables fast collection of 2K real-world episodes without teleoperation. Extensive evaluations demonstrate remarkable improvements in response speed, perception, and generalization, positioning DynamicVLA as a unified framework for general dynamic object manipulation across embodiments.
中文标题/摘要
标题:DynamicVLA:一种用于动态物体操作的视觉-语言-行动模型
对于动态物体的操作,视觉-语言-行动(VLA)模型仍然面临挑战,尽管在静态操作上表现出强大的泛化能力,但在需要快速感知、时间预测和连续控制的动态场景中却难以应对。我们提出了DynamicVLA,一种通过三个关键设计结合时间推理和闭环适应的框架:1)一个紧凑的0.4B VLA,使用卷积视觉编码器进行空间高效、结构忠实的编码,实现快速多模态推理;2)连续推理,实现重叠的推理和执行,以降低延迟并及时适应物体运动;3)潜在感知动作流式传输,通过强制执行时间对齐的动作执行来弥合感知-执行差距。为了填补动态操作数据的空白,我们引入了Dynamic Object Manipulation(DOM)基准,该基准从头开始构建,使用自动数据收集管道高效地收集了跨越2800个场景和206个物体的20万合成集,并能够快速收集2000个无需远程操作的真实世界集。广泛的评估表明,在响应速度、感知和泛化方面取得了显著改进,将DynamicVLA定位为一种统一框架,适用于各种体态下的通用动态物体操作。
Summary / 总结
DynamicVLA is a framework designed to address the challenges of dynamic object manipulation, integrating temporal reasoning and closed-loop adaptation. It uses a compact vision-language-action model with a convolutional vision encoder for efficient multimodal inference, Continuous Inference for lower latency, and Latent-aware Action Streaming to align perception and execution. The framework demonstrates significant improvements in response speed, perception, and generalization, as evaluated through the DOM benchmark, which includes both synthetic and real-world data collected via an auto data collection pipeline.
DynamicVLA 是一个旨在解决动态物体操作挑战的框架,结合了时间推理和闭环适应。它使用了一个紧凑的视觉-语言-行动模型,带有卷积视觉编码器以实现高效的多模态推理,Continuous Inference 以降低延迟,并通过 Latent-aware Action Streaming 来对齐感知和执行。该框架通过 DOM 基准测试展示了显著的响应速度、感知能力和泛化能力的提升,该基准测试包括通过自动数据收集管道收集的合成和真实世界数据。
Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions
Authors: Xiaoxiao Sun, Mingyang Li, Kun yuan, Min Woo Sun, Mark Endo, Shengguang Wu, Changlin Li, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy
First: 2026-01-29T18:59:24+00:00 · Latest: 2026-01-29T18:59:24+00:00
Comments: 26 pages, 31 figures, 13 tables. Project Page: https://sites.google.com/view/vi-probe/
Abstract
Large Vision-Language Models (VLMs) often answer classic visual illusions "correctly" on original images, yet persist with the same responses when illusion factors are inverted, even though the visual change is obvious to humans. This raises a fundamental question: do VLMs perceive visual changes or merely recall memorized patterns? While several studies have noted this phenomenon, the underlying causes remain unclear. To move from observations to systematic understanding, this paper introduces VI-Probe, a controllable visual-illusion framework with graded perturbations and matched visual controls (without illusion inducer) that disentangles visually grounded perception from language-driven recall. Unlike prior work that focuses on averaged accuracy, we measure stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. Experiments across different families reveal that response persistence arises from heterogeneous causes rather than a single mechanism. For instance, GPT-5 exhibits memory override, Claude-Opus-4.1 shows perception-memory competition, while Qwen variants suggest visual-processing limits. Our findings challenge single-cause views and motivate probing-based evaluation that measures both knowledge and sensitivity to controlled visual change. Data and code are available at https://sites.google.com/view/vi-probe/.
中文标题/摘要
标题:VLMs 是感知还是回忆?经典视觉错觉探究视觉感知与记忆
大型视觉-语言模型(VLMs)在原始图像上对经典视觉错觉通常会给出正确的回答,但在错觉因素反转后仍坚持相同的回答,尽管人类可以明显察觉视觉变化。这引发了一个基本问题:VLMs 是感知视觉变化还是仅仅回忆已记忆的模式?尽管已有几项研究注意到了这一现象,但其背后的成因仍不清楚。为了从观察转向系统理解,本文引入了VI-Probe,这是一种可控的视觉错觉框架,具有分级扰动和匹配的视觉对照(无错觉诱导器),以解开基于视觉的感知与语言驱动的回忆之间的关系。不同于以往工作主要关注平均准确率,我们使用极性反转一致性、模板固定指数和与匹配对照归一化的错觉乘数来衡量稳定性和敏感性。不同家族的实验表明,反应持久性源自多种原因而非单一机制。例如,GPT-5 表现出记忆覆盖,Claude-Opus-4.1 显示感知与记忆的竞争,而 Qwen 变体则表明视觉处理的限制。我们的发现挑战了单一成因的观点,并促使基于探针的评估,以衡量知识和对受控视觉变化的敏感性。数据和代码可在 https://sites.google.com/view/vi-probe/ 获取。
Summary / 总结
This paper investigates whether large vision-language models (VLMs) perceive visual changes or merely recall memorized patterns by using a controllable visual-illusion framework called VI-Probe. The study measures stability and sensitivity using Polarity-Flip Consistency, Template Fixation Index, and an illusion multiplier normalized against matched controls. The experiments across different VLMs reveal that response persistence arises from various causes, challenging the notion of a single mechanism. The findings suggest that probing-based evaluation is necessary to measure both knowledge and sensitivity to controlled visual change.
该研究通过使用可控视觉错觉框架VI-Probe,探讨大型视觉语言模型(VLMs)是感知视觉变化还是仅回忆记忆模式。研究使用极性反转一致性、模板固定指数和与匹配控制相比归一化的错觉乘数来衡量稳定性和敏感性。不同VLMs的实验结果显示,响应持久性来源于多种原因,挑战了单一原因的观点,并强调了需要进行基于探针的评估,以衡量模型对受控视觉变化的知识和敏感性。
DynaWeb: Model-Based Reinforcement Learning of Web Agents
Authors: Hang Ding, Peidong Liu, Junqiao Wang, Ziwei Ji, Meng Cao, Rongzhao Zhang, Lynn Ai, Eric Yang, Tianyu Shi, Lei Yu
First: 2026-01-29T18:59:07+00:00 · Latest: 2026-01-29T18:59:07+00:00
Abstract
The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.
中文标题/摘要
标题:DynaWeb:基于模型的强化学习网页代理
自主网页代理的发展,由大型语言模型(LLMs)和强化学习(RL)驱动,代表了通用人工智能助手的重要一步。然而,训练这些代理受到与实时互联网交互的挑战限制,这既低效又昂贵且充满风险。基于模型的强化学习(MBRL)通过学习环境模型以实现模拟交互提供了有希望的解决方案。本文介绍了DynaWeb,这是一种新颖的MBRL框架,通过与训练以预测给定代理行为的自然网页表示的网页世界模型交互来训练网页代理。该模型充当合成的网页环境,代理策略可以在其中生成大量回放动作轨迹以实现高效的在线强化学习。除了免费策略回放外,DynaWeb还整合了来自训练数据的真实专家轨迹,这些轨迹在训练期间随机与策略回放交织,以提高稳定性和样本效率。在具有挑战性的WebArena和WebVoyager基准测试中进行的实验表明,DynaWeb能够持续且显著地提高最先进的开源网页代理模型的性能。我们的研究结果证明了通过想象训练网页代理的可行性,提供了扩展在线代理性RL的一种可扩展且高效的方法。
Summary / 总结
DynaWeb is a model-based reinforcement learning framework designed to train web agents using a web world model that predicts naturalistic web page representations. This approach enables efficient simulation and interaction, reducing the need for costly and risky live internet interactions. Experiments show that DynaWeb improves the performance of state-of-the-art web agent models on challenging benchmarks, demonstrating the potential of using imagination for scalable and efficient online agentic reinforcement learning.
DynaWeb 是一种基于模型的强化学习框架,通过与预测自然网页表示的网络世界模型交互来训练网络代理。这种方法通过模拟交互实现高效的在线强化学习,减少了对昂贵且有风险的实时互联网交互的需求。在 WebArena 和 WebVoyager 基准测试中的实验表明,DynaWeb 显著提高了最先进的网络代理模型的性能,展示了通过想象训练网络代理的潜力,为在线强化学习的可扩展性和高效性提供了途径。
FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale
Authors: Ajay Patel, Colin Raffel, Chris Callison-Burch
First: 2026-01-29T18:58:47+00:00 · Latest: 2026-01-29T18:58:47+00:00
Abstract
Due to limited supervised training data, large language models (LLMs) are typically pre-trained via a self-supervised "predict the next word" objective on a vast amount of unstructured text data. To make the resulting model useful to users, it is further trained on a far smaller amount of "instruction-tuning" data comprised of supervised training examples of instructions and responses. To overcome the limited amount of supervised data, we propose a procedure that can transform the knowledge in internet-scale pre-training documents into billions of synthetic instruction and answer training pairs. The resulting dataset, called FineInstructions, uses ~18M instruction templates created from real user-written queries and prompts. These instruction templates are matched to and instantiated with human-written source documents from unstructured pre-training corpora. With "supervised" synthetic training data generated at this scale, an LLM can be pre-trained from scratch solely with the instruction-tuning objective, which is far more in-distribution with the expected downstream usage of LLMs (responding to user prompts). We conduct controlled token-for-token training experiments and find pre-training on FineInstructions outperforms standard pre-training and other proposed synthetic pre-training techniques on standard benchmarks measuring free-form response quality. Our resources can be found at https://huggingface.co/fineinstructions .
中文标题/摘要
标题:FineInstructions: 将预训练规模扩展到合成指令
由于监督训练数据有限,大型语言模型(LLMs)通常通过在大量无结构文本数据上使用自我监督的“预测下一个词”目标进行预训练。为了使最终模型对用户有用,它还会在少量指令调优数据上进行进一步训练,这些数据由指令和响应的监督训练示例组成。为了克服有限的监督数据,我们提出了一种程序,可以将互联网规模预训练文档中的知识转化为数十亿组合成指令和答案训练对。由此产生的数据集称为FineInstructions,使用来自真实用户查询和提示的约1800万指令模板。这些指令模板与并实例化自无结构预训练语料库中的人类撰写的源文档。通过在如此大规模的“监督”合成训练数据上进行预训练,LLM可以从头开始仅使用指令调优目标进行预训练,这与LLM预期下游使用情况(响应用户提示)更为一致。我们进行了受控的逐词训练实验,并发现使用FineInstructions进行预训练在衡量自由形式响应质量的标准基准上优于标准预训练和其他提出的合成预训练技术。我们的资源可以在https://huggingface.co/fineinstructions 获取。
Summary / 总结
The research aims to address the limitation of supervised training data for large language models by proposing a method to generate synthetic instruction and answer pairs. The method involves using 18 million instruction templates derived from real user queries and prompts, which are then instantiated with human-written source documents. The resulting FineInstructions dataset enables pre-training LLMs solely on the instruction-tuning objective, leading to better performance on standard benchmarks compared to standard pre-training and other synthetic pre-training techniques.
研究旨在通过提出一种生成大规模合成指令和答案对的方法来解决大型语言模型(LLMs)监督训练数据不足的问题。FineInstructions数据集基于1800万个来自真实用户查询的指令模板,并与人类撰写的源文档实例化,使LLMs能够仅通过指令调优目标进行从零开始的预训练,这与预期的下游使用场景更加一致。实验表明,使用FineInstructions预训练在衡量自由形式响应质量的标准基准上优于标准预训练和其他合成预训练技术。
Think Twice: Branch-and-Rethink Reasoning Reward Model
Authors: Yizhu Jiao, Jiaqi Zeng, Julien Veron Vialard, Oleksii Kuchaiev, Jiawei Han, Olivier Delalleau
First: 2025-10-27T17:58:07+00:00 · Latest: 2026-01-29T18:57:46+00:00
Comments: Source Code: https://github.com/yzjiao/BR-RM. Model Checkpoints: https://huggingface.co/nvidia/Qwen3-Nemotron-14B-BRRM and https://huggingface.co/nvidia/Qwen3-Nemotron-8B-BRRM
Abstract
Large language models (LLMs) increasingly rely on thinking models that externalize intermediate steps and allocate extra test-time compute, with think-twice strategies showing that a deliberate second pass can elicit stronger reasoning. In contrast, most reward models (RMs) still compress many quality dimensions into a single scalar in one shot, a design that induces judgment diffusion: attention spreads across evaluation criteria, yielding diluted focus and shallow analysis. We introduce branch-and-rethink (BR-RM), a two-turn RM that transfers the think-twice principle to reward modeling. Turn 1 performs adaptive branching, selecting a small set of instance-critical dimensions (such as factuality and safety) and sketching concise, evidence-seeking hypotheses. Turn 2 executes branch-conditioned rethinking, a targeted reread that tests those hypotheses and scrutinizes only what matters most. We train with GRPO-style reinforcement learning over structured two-turn traces using a simple binary outcome reward with strict format checks, making the approach compatible with standard RLHF pipelines. By converting all-at-once scoring into focused, second-look reasoning, BR-RM reduces judgment diffusion and improves sensitivity to subtle yet consequential errors while remaining practical and scalable. Experimental results demonstrate that our model achieves state-of-the-art performance on three challenging reward modeling benchmarks across diverse domains.
中文标题/摘要
标题:思两次:分支与重思奖励推理模型
大型语言模型(LLMs)越来越多地依赖于将中间步骤外部化并分配额外的测试时计算量的思考模型,其中思两次策略表明,经过深思熟虑的第二次思考可以激发更强的推理能力。相比之下,大多数奖励模型(RMs)仍然一次性压缩多个质量维度为单一标量,这种设计会导致判断分散:注意力在评估标准之间分散,导致分散的焦点和浅薄的分析。我们引入了分支与重思(BR-RM),这是一种两轮的奖励模型,将思两次原则应用于奖励建模。第一轮执行自适应分支,选择一组关键实例维度(如事实性和安全性),并勾勒出简明、证据导向的假设。第二轮执行条件分支重思,这是一种有针对性的重读,测试这些假设并仅审查最重要的内容。我们使用基于GRPO风格的强化学习训练结构化的两轮轨迹,使用简单的二元结果奖励并进行严格的格式检查,使该方法与标准的RLHF流水线兼容。通过将一次性评分转换为集中、二次审视的推理,BR-RM减少了判断分散,提高了对微妙但重要的错误的敏感性,同时保持了实用性和可扩展性。实验结果表明,我们的模型在三个跨领域挑战性的奖励建模基准上达到了最先进的性能。
Summary / 总结
The research aims to address the issue of judgment diffusion in reward models by introducing a two-turn reasoning model called branch-and-rethink (BR-RM). This model uses adaptive branching in the first turn to select critical dimensions and formulate concise hypotheses, followed by a targeted rethinking in the second turn to test these hypotheses. The model is trained using reinforcement learning with a binary outcome reward and strict format checks. Experimental results show that BR-RM outperforms existing models on three challenging reward modeling benchmarks, demonstrating improved sensitivity to subtle errors while maintaining practicality and scalability.
研究旨在通过引入两轮推理模型分支-重思(BR-RM)来解决奖励模型中的判断扩散问题。该模型在第一轮中通过自适应分支选择关键维度并提出简洁的假设,第二轮则进行针对性的重思来测试这些假设。实验结果表明,BR-RM 在三个具有挑战性的奖励建模基准测试中表现优于现有模型,展示了对细微但重要的错误更高的敏感性,同时保持了实用性和可扩展性。
MORPH: PDE Foundation Models with Arbitrary Data Modality
Authors: Mahindra Singh Rautela, Alexander Most, Siddharth Mansingh, Bradley C. Love, Alexander Scheinker, Diane Oyen, Nathan Debardeleben, Earl Lawrence, Ayan Biswas
First: 2025-09-25T22:38:36+00:00 · Latest: 2026-01-29T18:57:23+00:00
Abstract
We introduce MORPH, a modality-agnostic, autoregressive foundation model for partial differential equations (PDEs). MORPH is built on a convolutional vision transformer backbone that seamlessly handles heterogeneous spatiotemporal datasets of varying data modality (1D--3D) at different resolutions, and multiple fields with mixed scalar and vector components. The architecture combines (i) component-wise convolution, which jointly processes scalar and vector channels to capture local interactions, (ii) inter-field cross-attention, which models and selectively propagates information between different physical fields, (iii) axial attentions, which factorize full spatiotemporal self-attention along individual spatial and temporal axes to reduce computational burden while retaining expressivity. We pretrain multiple model variants on a diverse collection of heterogeneous PDE datasets and evaluate transfer to a range of downstream prediction tasks. Using both full-model fine-tuning and parameter-efficient low-rank adapters, MORPH outperforms models trained from scratch. Across extensive evaluations, MORPH matches or surpasses strong baselines and recent state-of-the-art models. Collectively, these capabilities present a flexible and powerful backbone for learning from the heterogeneous and multimodal nature of scientific observations, charting a path toward scalable and data-efficient scientific machine learning. The source code, datasets, and models are publicly available at https://github.com/lanl/MORPH.
中文标题/摘要
标题:MORPH:任意数据模态的偏微分方程基础模型
我们介绍了MORPH,一种模态无关的自回归偏微分方程(PDE)基础模型。MORPH基于卷积视觉变换器骨干网络,能够无缝处理不同数据模态(1D-3D)和不同分辨率的异质时空数据集,以及具有混合标量和矢量分量的多个字段。该架构结合了(i)分量卷积,联合处理标量和矢量通道以捕捉局部交互,(ii)跨字段交叉注意力,建模并选择性地传播不同物理场之间的信息,(iii)轴向注意力,沿个体空间和时间轴分解全时空自注意力,以减少计算负担同时保留表达能力。我们使用多样化的异质PDE数据集对多个模型变体进行预训练,并评估其在一系列下游预测任务中的迁移性能。使用全模型微调和参数高效低秩适配器,MORPH优于从头训练的模型。在广泛的评估中,MORPH匹配或超越了强大的基线和最近的先进模型。这些能力共同展示了学习科学观测的异构和多模态性质的灵活而强大的基础架构,为可扩展和数据高效的科学机器学习铺平了道路。源代码、数据集和模型可在https://github.com/lanl/MORPH/公开获取。
Routing the Lottery: Adaptive Subnetworks for Heterogeneous Data
Authors: Grzegorz Stefanski, Alberto Presta, Michal Byra
First: 2026-01-29T18:56:41+00:00 · Latest: 2026-01-29T18:56:41+00:00
Abstract
In pruning, the Lottery Ticket Hypothesis posits that large networks contain sparse subnetworks, or winning tickets, that can be trained in isolation to match the performance of their dense counterparts. However, most existing approaches assume a single universal winning ticket shared across all inputs, ignoring the inherent heterogeneity of real-world data. In this work, we propose Routing the Lottery (RTL), an adaptive pruning framework that discovers multiple specialized subnetworks, called adaptive tickets, each tailored to a class, semantic cluster, or environmental condition. Across diverse datasets and tasks, RTL consistently outperforms single- and multi-model baselines in balanced accuracy and recall, while using up to 10 times fewer parameters than independent models and exhibiting semantically aligned. Furthermore, we identify subnetwork collapse, a performance drop under aggressive pruning, and introduce a subnetwork similarity score that enables label-free diagnosis of oversparsification. Overall, our results recast pruning as a mechanism for aligning model structure with data heterogeneity, paving the way toward more modular and context-aware deep learning.
中文标题/摘要
标题:路由彩票:异质数据的自适应子网络
在剪枝中,彩票票假说认为大型网络包含稀疏的子网络,或胜出的彩票,这些子网络可以在隔离状态下训练以匹配其密集对应物的性能。然而,大多数现有方法假设所有输入共享一个通用的胜出彩票,忽略了真实世界数据的固有异质性。在本文中,我们提出了路由彩票(RTL),一种自适应剪枝框架,发现多个专门化的子网络,称为自适应彩票,每个子网络都针对一个类别、语义簇或环境条件进行定制。在多种数据集和任务中,RTL在平衡准确率和召回率方面始终优于单模型和多模型基线,同时使用比独立模型少10倍的参数,并表现出语义对齐。此外,我们识别了子网络崩溃,即在激进剪枝下性能下降,并引入了子网络相似度评分,以实现无标签诊断过度稀疏化。总体而言,我们的结果重新定义了剪枝作为机制,使其与数据异质性对齐,为更模块化和上下文感知的深度学习铺平了道路。
Summary / 总结
The research aims to address the limitations of the Lottery Ticket Hypothesis by proposing an adaptive pruning framework called Routing the Lottery (RTL), which discovers multiple specialized subnetworks, or adaptive tickets, each tailored to specific classes or conditions. The method consistently outperforms single- and multi-model baselines in balanced accuracy and recall, using significantly fewer parameters and showing semantically aligned performance. Key findings include the identification of subnetwork collapse and the introduction of a subnetwork similarity score for diagnosing oversparsification, suggesting that pruning can align model structure with data heterogeneity more effectively.
研究旨在通过提出一种自适应剪枝框架Routing the Lottery (RTL),发现多个专门针对特定类或条件的子网络,或称为适应性票。该方法在平衡准确率和召回率上始终优于单模型和多模型基线,使用参数显著更少且表现出语义对齐的性能。关键发现包括子网络崩溃现象的识别以及引入子网络相似度评分以诊断过度稀疏化,表明剪枝可以更有效地使模型结构与数据异质性对齐。
Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers
Authors: Xin Chen, Feng Jiang, Yiqian Zhang, Hardy Chen, Shuo Yan, Wenya Xie, Min Yang, Shujian Huang
First: 2026-01-29T18:56:12+00:00 · Latest: 2026-01-29T18:56:12+00:00
Comments: The manuscript is under review
Abstract
Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70\% higher accuracy, 22.90\% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}
中文标题/摘要
标题:边问边思:将大型语言模型从被动求解者转变为积极问询者
以推理为导向的大规模语言模型(LLMs)通过链式思考(CoT)提示取得了显著进展,但它们仍然受限于一种\emph{盲目的自我思考}范式:即使关键信息缺失或模糊不清,它们也会进行大量内部推理。我们提出了积极互动推理(PIR),这是一种新的推理范式,将LLMs从被动求解者转变为能够将推理与澄清交织在一起的积极问询者。PIR不同于现有的主要通过查询外部环境来解决知识不确定性的搜索或工具框架,而是直接与用户互动来解决前提和意图层面的不确定性。PIR通过两个核心组件实现:(1)一种基于不确定性的监督微调程序,使模型具备互动推理能力;(2)一种基于用户模拟器的策略优化框架,由复合奖励驱动,使模型行为与用户意图保持一致。在数学推理、代码生成和文档编辑方面的广泛实验表明,PIR在准确率、通过率和BLEU分数上均优于强基线,分别提高了32.70%、22.90%和41.36%,同时减少了近一半的推理计算和不必要的互动轮次。进一步的事实知识可靠性评估、问答和缺失前提场景评估证实了PIR的强大泛化能力和鲁棒性。模型和代码已公开发布于:https://github.com/SUAT-AIRI/Proactive-Interactive-R1
Summary / 总结
The paper introduces Proactive Interactive Reasoning (PIR), a new paradigm that transforms reasoning-oriented Large Language Models (LLMs) from passive solvers to proactive inquirers. PIR interleaves reasoning with user interaction to address premise- and intent-level uncertainty. Experiments show PIR outperforms strong baselines in mathematical reasoning, code generation, and document editing, with higher accuracy, pass rate, and BLEU scores, while reducing reasoning computation and unnecessary interactions. Further evaluations confirm PIR's robustness and generalization across different scenarios.
论文提出了一种新的推理范式Proactive Interactive Reasoning (PIR),将大型语言模型(LLMs)从被动求解者转变为积极的问询者。PIR 将推理与用户交互相结合,以解决前提和意图层面的不确定性。实验表明,PIR 在数学推理、代码生成和文档编辑等方面优于强基线,具有更高的准确率、通过率和BLEU分数,同时减少了推理计算和不必要的交互轮次。可靠性评估进一步证实了PIR的鲁棒性和泛化能力。
"Not in My Backyard": LLMs Uncover Online and Offline Social Biases Against Homelessness
Authors: Jonathan A. Karr, Benjamin F. Herbst, Matthew L. Sisk, Xueyun Li, Ting Hua, Matthew Hauenstein, Georgina Curto, Nitesh V. Chawla
First: 2025-08-14T17:58:34+00:00 · Latest: 2026-01-29T18:55:57+00:00
Abstract
Homelessness is a persistent social challenge, impacting millions worldwide. Over 876,000 people experienced homelessness (PEH) in the U.S. in 2025. Social bias is a significant barrier to alleviation, shaping public perception and influencing policymaking. Given that online textual media and offline city council discourse reflect and influence part of public opinion, it provides valuable insights to identify and track social biases against PEH. We present a new, manually-annotated multi-domain dataset compiled from Reddit, X (formerly Twitter), news articles, and city council meeting minutes across ten U.S. cities. Our 16-category multi-label taxonomy creates a challenging long-tail classification problem: some categories appear in less than 1% of samples, while others exceed 70%. We find that small human-annotated datasets (1,702 samples) are insufficient for training effective classifiers, whether used to fine-tune encoder models or as few-shot examples for LLMs. To address this, we use GPT-4.1 to generate pseudo-labels on a larger unlabeled corpus. Training on this expanded dataset enables even small encoder models (ModernBERT, 150M parameters) to achieve 35.23 macro-F1, approaching GPT-4.1's 41.57. This demonstrates that \textbf{data quantity matters more than model size}, enabling low-cost, privacy-preserving deployment without relying on commercial APIs. Our results reveal that negative bias against PEH is prevalent both offline and online (especially on Reddit), with "not in my backyard" narratives showing the highest engagement. These findings uncover a type of ostracism that directly impacts poverty-reduction policymaking and provide actionable insights for practitioners addressing homelessness.
中文标题/摘要
标题:"不在我后院": 大型语言模型揭示对无家可归者在线和离线的社会偏见
无家可归是一个持续的社会挑战,影响着全世界数百万人。2025年,美国有超过876,000人经历无家可归(PEH)。社会偏见是缓解这一问题的重要障碍,影响公众认知并影响政策制定。鉴于在线文本媒体和离线城市议会讨论反映了部分公众意见并对其产生影响,它们提供了识别和追踪对PEH的社会偏见的重要见解。我们提出了一套新的、人工标注的多领域数据集,该数据集来自来自美国十个城市的Reddit、X(原Twitter)、新闻文章和城市议会会议记录。我们的16类多标签分类体系创建了一个具有挑战性的长尾分类问题:一些类别在样本中出现的比例不到1%,而其他类别则超过70%。我们发现,小规模的人工标注数据集(1,702个样本)不足以训练有效的分类器,无论是用于微调编码器模型还是作为LLM的少量示例。为了解决这个问题,我们使用GPT-4.1在更大规模的未标注语料库上生成伪标签。在扩展数据集上进行训练使即使是小型编码器模型(ModernBERT,1.5亿参数)也能达到35.23的宏F1值,接近GPT-4.1的41.57。这表明数据量比模型规模更重要,能够实现低成本、隐私保护的部署,无需依赖商业API。我们的研究结果揭示了无家可归者在线和离线(尤其是Reddit)都普遍存在负面偏见,"不在我后院"的叙事具有最高的参与度。这些发现揭示了一种直接影响减贫政策制定的排斥行为,并为应对无家可归问题的从业者提供了可操作的见解。
StepShield: When, Not Whether to Intervene on Rogue Agents
Authors: Gloria Felicia, Michael Eniolade, Jinfeng He, Zitha Sasindran, Hemant Kumar, Milan Hussain Angati, Sandeep Bandarupalli
First: 2026-01-29T18:55:46+00:00 · Latest: 2026-01-29T18:55:46+00:00
Comments: 16 pages, 2 figures, 14 tables
Abstract
Existing agent safety benchmarks report binary accuracy, conflating early intervention with post-mortem analysis. A detector that flags a violation at step 8 enables intervention; one that reports it at step 48 provides only forensic value. This distinction is critical, yet current benchmarks cannot measure it. We introduce StepShield, the first benchmark to evaluate when violations are detected, not just whether. StepShield contains 9,213 code agent trajectories, including 1,278 meticulously annotated training pairs and a 7,935-trajectory test set with a realistic 8.1% rogue rate. Rogue behaviors are grounded in real-world security incidents across six categories. We propose three novel temporal metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. Surprisingly, our evaluation reveals that an LLM-based judge achieves 59% EIR while a static analyzer achieves only 26%, a 2.3x performance gap that is entirely invisible to standard accuracy metrics. We further show that early detection has direct economic benefits: our cascaded HybridGuard detector reduces monitoring costs by 75% and projects to $108M in cumulative savings over five years at enterprise scale. By shifting the focus of evaluation from whether to when, StepShield provides a new foundation for building safer and more economically viable AI agents. The code and data are released under an Apache 2.0 license.
中文标题/摘要
标题:StepShield:何时干预,而非是否干预违规代理
现有的代理安全性基准报告二元准确性,混淆了早期干预与事后分析。在第8步标记违规行为可以启用干预;而在第48步报告则仅具有法医价值。这种区别至关重要,但当前的基准无法衡量这一点。我们引入了StepShield,这是第一个评估检测到违规行为的时间点,而不仅仅是是否的基准。StepShield包含9,213个代码代理轨迹,包括1,278个精心标注的训练对和一个包含7,935个轨迹的测试集,现实中的违规率为8.1%。违规行为基于六类真实世界的安全事件。我们提出了三个新的时间度量标准:早期干预率(EIR)、干预差距和节省的令牌数。令人惊讶的是,我们的评估表明,基于LLM的法官实现了59%的EIR,而静态分析器仅实现了26%,标准准确性指标完全无法看到这种2.3倍的性能差距。我们进一步表明,早期检测具有直接的经济效益:我们的级联HybridGuard检测器将监控成本降低了75%,并在企业规模上预计在未来五年内节省1.08亿美元。通过将评估的重点从是否转移到何时,StepShield为构建更安全且更具经济效益的AI代理提供了新的基础。代码和数据在Apache 2.0许可证下发布。
Summary / 总结
StepShield evaluates when violations are detected rather than just whether they are detected, addressing the limitations of existing benchmarks. It introduces three novel temporal metrics: Early Intervention Rate (EIR), Intervention Gap, and Tokens Saved. The study finds that an LLM-based judge outperforms a static analyzer, achieving 59% EIR compared to 26% for the static analyzer, highlighting the importance of early detection. Additionally, early detection leads to significant cost savings, with a cascaded HybridGuard detector reducing monitoring costs by 75% and projecting to $108M in cumulative savings over five years at enterprise scale.
StepShield 引入了一个新的基准,以评估 AI 代理中违规检测的时间,而不是仅仅评估二元准确性。它包含 9,213 个代码代理轨迹,其中包括 1,278 个标注的训练对和一个包含 7,935 个轨迹的测试集。评估结果显示,基于 LLM 的裁判(早期干预率为 59%)与静态分析器(26%)之间存在显著的性能差距,早期检测可以带来巨大的成本节约。通过关注违规何时被检测到,StepShield 提供了一种构建更安全且更具经济效益的 AI 代理的新方法。
Pay for Hints, Not Answers: LLM Shepherding for Cost-Efficient Inference
Authors: Ziming Dong, Hardik Sharma, Evan O'Toole, Jaya Prakash Champati, Kui Wu
First: 2026-01-29T18:52:54+00:00 · Latest: 2026-01-29T18:52:54+00:00
Abstract
Large Language Models (LLMs) deliver state-of-the-art performance on complex reasoning tasks, but their inference costs limit deployment at scale. Small Language Models (SLMs) offer dramatic cost savings yet lag substantially in accuracy. Existing approaches - routing and cascading - treat the LLM as an all-or-nothing resource: either the query bypasses the LLM entirely, or the LLM generates a complete response at full cost. We introduce LLM Shepherding, a framework that requests only a short prefix (a hint) from the LLM and provides it to SLM. This simple mechanism is surprisingly effective for math and coding tasks: even hints comprising 10-30% of the full LLM response improve SLM accuracy significantly. Shepherding generalizes both routing and cascading, and it achieves lower cost under oracle decision-making. We develop a two-stage predictor that jointly determines whether a hint is needed and how many tokens to request. On the widely-used mathematical reasoning (GSM8K, CNK12) and code generation (HumanEval, MBPP) benchmarks, Shepherding reduces costs by 42-94% relative to LLM-only inference. Compared to state-of-the-art routing and cascading baselines, shepherding delivers up to 2.8x cost reduction while matching accuracy. To our knowledge, this is the first work to exploit token-level budget control for SLM-LLM collaboration.
中文标题/摘要
标题:付费获取提示,而非答案:LLM 管理以实现成本效益推理
大型语言模型(LLMs)在复杂推理任务上提供最先进的性能,但其推理成本限制了大规模部署。小型语言模型(SLMs)提供显著的成本节省,但在准确性上落后很多。现有方法——路由和级联——将LLM视为全有或全无的资源:查询要么完全绕过LLM,要么LLM以全额成本生成完整响应。我们引入了LLM管理框架,该框架仅请求LLM提供一个短前缀(提示),并将该提示提供给SLM。这种简单的机制在数学和编程任务上表现出惊人的效果:即使提示仅占完整LLM响应的10-30%,也能显著提高SLM的准确性。管理既适用于路由也适用于级联,并且在最优决策下成本更低。我们开发了一种两阶段预测器,以共同确定是否需要提示以及请求多少令牌。在广泛使用的数学推理(GSM8K,CNK12)和代码生成(HumanEval,MBPP)基准测试中,管理将成本降低了42-94%。与最先进的路由和级联基线相比,管理在成本上最多可降低2.8倍,同时保持相同的准确性。据我们所知,这是首次利用令牌级预算控制实现SLM-LLM协作的工作。
Summary / 总结
The paper introduces LLM Shepherding, a method that requests only a short prefix (hint) from a Large Language Model (LLM) and provides it to a Small Language Model (SLM) to improve cost efficiency without significantly compromising accuracy. This approach reduces inference costs by 42-94% on benchmarks like GSM8K, CNK12, HumanEval, and MBPP compared to using LLMs alone. It also outperforms existing routing and cascading methods, achieving up to 2.8x cost reduction while maintaining similar accuracy levels.
本文提出了一种LLM牧羊方法,该方法仅从大型语言模型(LLM)请求一个短前缀(提示),并将其提供给小型语言模型(SLM),以显著提高SLM在数学和编程任务上的准确性,同时将成本降低42-94%。两阶段预测器确定何时以及请求多少令牌,实现最高2.8倍的成本降低,而不影响准确性。
World of Workflows: a Benchmark for Bringing World Models to Enterprise Systems
Authors: Lakshya Gupta, Litao Li, Yizhe Liu, Sriram Ganapathi Subramanian, Kaheer Suleman, Zichen Zhang, Haoye Lu, Sumit Pasupalak
First: 2026-01-29T18:51:54+00:00 · Latest: 2026-01-29T18:51:54+00:00
Abstract
Frontier large language models (LLMs) excel as autonomous agents in many domains, yet they remain untested in complex enterprise systems where hidden workflows create cascading effects across interconnected databases. Existing enterprise benchmarks evaluate surface-level agentic task completion similar to general consumer benchmarks, ignoring true challenges in enterprises, such as limited observability, large database state, and hidden workflows with cascading side effects. We introduce World of Workflows (WoW), a realistic ServiceNow-based environment incorporating 4,000+ business rules and 55 active workflows embedded in the system, alongside WoW-bench, a benchmark of 234 tasks evaluating constrained agentic task completion and enterprise dynamics modeling capabilities. We reveal two major takeaways: (1) Frontier LLMs suffer from dynamics blindness, consistently failing to predict the invisible, cascading side effects of their actions, which leads to silent constraint violations, and (2) reliability in opaque systems requires grounded world modeling, where agents must mentally simulate hidden state transitions to bridge the observability gap when high-fidelity feedback is unavailable. For reliable and useful enterprise agents, WoW motivates a new paradigm to explicitly learn system dynamics. We release our GitHub for setting up and evaluating WoW.
中文标题/摘要
标题:工作流世界:将世界模型引入企业系统的基准
前沿的大语言模型(LLMs)在许多领域作为自主代理表现出色,但在复杂的 enterprise 系统中仍未经测试,这些系统中的隐藏工作流会在相互连接的数据库之间产生级联效应。现有的企业基准测试评估表面级别的代理任务完成,类似于一般的消费者基准测试,忽略了企业中的真正挑战,如有限的可观测性、庞大的数据库状态以及隐藏的工作流及其级联副作用。我们引入了“工作流世界”(WoW),这是一个基于 ServiceNow 的现实环境,包含 4000 多个业务规则和 55 个嵌入系统中的活跃工作流,以及 WoW-bench,这是一个包含 234 个任务的基准,评估受限的代理任务完成能力和企业动态建模能力。我们揭示了两个主要发现:(1)前沿的 LLMs 存在动态盲点,一致地未能预测其行为的隐形级联副作用,导致无声的约束违规;(2)在不透明系统中的可靠性需要基于世界建模,代理必须在高保真反馈不可用时在心中模拟隐藏状态的过渡,以弥合可观测性差距。为了实现可靠且有用的企业代理,WoW 促使一种新的范式,即明确学习系统动力学。我们发布了 GitHub 以设置和评估 WoW。
Summary / 总结
The research introduces World of Workflows (WoW), a benchmark for evaluating large language models (LLMs) in complex enterprise systems, which are characterized by hidden workflows and cascading side effects. WoW includes a realistic ServiceNow-based environment with 4,000+ business rules and 55 active workflows. The study reveals that frontier LLMs struggle with predicting cascading side effects, leading to silent constraint violations, and emphasizes the need for grounded world modeling to bridge the observability gap in opaque systems. The benchmark evaluates 234 tasks focusing on constrained agentic task completion and enterprise dynamics modeling capabilities.
论文提出了World of Workflows (WoW),一个用于测试大型语言模型(LLMs)在复杂企业系统中的基准,这些系统中普遍存在隐藏的工作流和级联效应。WoW 使用了一个包含 4,000 多个业务规则和 55 个活跃工作流的 ServiceNow 基础环境,并通过 WoW-bench 评估了 234 个任务,涉及受限任务完成和企业动态建模。主要发现表明,LLMs 在预测级联副作用方面存在困难,导致约束违规,并强调了在不透明系统中需要进行基于世界建模以弥补可观测性差距的重要性。这项工作突出了明确学习系统动态对于可靠的企业代理的重要性。
SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents
Authors: Yifeng Ding, Lingming Zhang
First: 2026-01-29T18:50:29+00:00 · Latest: 2026-01-29T18:50:29+00:00
Abstract
Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is computationally expensive. While recent methods have attempted to mitigate costs using specialized value agents, they can suffer from model miscalibration and fail to generalize to modern agents that synthesize custom bash scripts as tools. In this paper, we introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates. SWE-Replay optimizes the scaling process by recycling trajectories from prior trials, dynamically choosing to either explore from scratch or exploit archived experience by branching at critical intermediate steps. This selection of intermediate steps is driven by the potential and reasoning significance of repository exploration, rather than external LLM-based quality estimates. Our evaluation shows that, on SWE-Bench Verified, SWE-Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%. Further evaluation on SWE-Bench Pro and Multilingual validates the generalizability of SWE-Replay, establishing it as a robust foundation for efficient test-time scaling of software engineering agents.
中文标题/摘要
标题:SWE-Replay: 软件工程代理的高效测试时扩展
测试时扩展已被广泛采用以增强大型语言模型(LLM)代理在软件工程(SWE)任务中的能力。然而,标准方法从头开始重复采样轨迹是计算上昂贵的。虽然最近的方法试图通过使用专门的价值代理来减轻成本,但它们可能会遭受模型校准不当的问题,并且无法泛化到现代代理,这些代理会合成自定义的bash脚本作为工具。在本文中,我们介绍了SWE-Replay,这是第一个无需依赖可能噪声的价值估计即可实现现代代理的高效且可泛化的测试时扩展技术。SWE-Replay通过回收先前试验中的轨迹来优化扩展过程,动态选择从头探索或利用存档经验在关键中间步骤进行分支。这种中间步骤的选择是基于仓库探索的潜力和推理意义,而不是外部LLM的质量估计。我们的评估表明,在SWE-Bench Verified上,SWE-Replay始终优于简单的扩展,成本最多可降低17.4%,同时保持或甚至提高性能最多3.8%。进一步在SWE-Bench Pro和多语言上的评估验证了SWE-Replay的泛化能力,确立了其作为软件工程代理高效测试时扩展稳健基础的地位。
Summary / 总结
This paper addresses the computational inefficiency of repeatedly sampling trajectories from scratch for test-time scaling in software engineering tasks. It introduces SWE-Replay, a technique that recycles trajectories from previous trials and dynamically decides whether to explore from scratch or exploit archived experience. SWE-Replay focuses on the potential and reasoning significance of repository exploration rather than relying on external quality estimates. The evaluation shows that SWE-Replay reduces costs by up to 17.4% while maintaining or improving performance by up to 3.8% on SWE-Bench Verified, and it is generalizable to other benchmarks.
论文提出了SWE-Replay,这是一种针对现代软件工程代理的高效且通用的测试时缩放技术。它解决了从头开始重复采样轨迹的计算效率低问题,并避免使用可能噪声较大的价值估计。SWE-Replay通过回收先前试验的轨迹并动态决定是探索还是利用存档经验来优化缩放过程。评估结果显示,SWE-Replay在SWE-Bench Verified上最多可减少17.4%的成本,同时保持或提高性能最多3.8%。它在其他基准上的表现也很好,验证了其鲁棒性。
The Patient is not a Moving Document: A World Model Training Paradigm for Longitudinal EHR
Authors: Irsyad Adam, Zekai Chen, David Laprade, Shaun Porwal, David Laub, Erik Reinertsen, Arda Pekis, Kevin Brown
First: 2026-01-29T18:49:37+00:00 · Latest: 2026-01-29T18:49:37+00:00
Abstract
Large language models (LLMs) trained with next-word-prediction have achieved success as clinical foundation models. Representations from these language backbones yield strong linear probe performance across biomedical tasks, suggesting that patient semantics emerge from next-token prediction at scale. However, this paradigm treats patients as a document to be summarized rather than a dynamical system to be simulated; a patient's trajectory emerges from their state evolving under interventions and time, requiring models that simulate dynamics rather than predict tokens. To address this, we introduce SMB-Structure, a world model for structured EHR that grounds a joint-embedding prediction architecture (JEPA) with next-token prediction (SFT). SFT grounds our model to reconstruct future patient states in token space, while JEPA predicts those futures in latent space from the initial patient representation alone, forcing trajectory dynamics to be encoded before the next state is observed. We validate across two large-scale cohorts: Memorial Sloan Kettering (23,319 oncology patients; 323,000+ patient-years) and INSPECT (19,402 pulmonary embolism patients). Using a linear probe evaluated at multiple points along the disease trajectory, we demonstrate that our training paradigm learns embeddings that capture disease dynamics not recoverable by autoregressive baselines, enabling SMB-Structure to achieve competitive performance on complex tasks characterized by high patient heterogeneity. Model weights are available at https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure.
中文标题/摘要
标题:患者不是移动的文档:一种用于纵向EHR的世界模型训练范式
大型语言模型(LLMs)通过下一个词预测训练,在临床基础模型方面取得了成功。这些语言骨干的表示在生物医学任务上表现出强大的线性探针性能,表明患者语义是从大规模的下一个词预测中涌现出来的。然而,这种范式将患者视为需要总结的文档,而不是需要模拟的动力系统;患者的轨迹是从其状态在干预和时间的作用下演变而来的,需要能够模拟动力学而不是预测词的模型。为了解决这个问题,我们引入了SMB-Structure,这是一种结构化EHR的世界模型,将联合嵌入预测架构(JEPA)与下一个词预测(SFT)相结合。SFT使我们的模型能够重建患者状态在词空间中的未来状态,而JEPA则仅从初始患者表示中预测这些未来状态在潜在空间中,迫使轨迹动力学在观察到下一个状态之前被编码。我们在两个大规模队列中进行了验证:纪念斯隆凯特林(23,319名肿瘤患者;323,000多个患者年)和INSPECT(19,402名肺栓塞患者)。使用沿疾病轨迹多个点评估的线性探针,我们证明我们的训练范式学习到的嵌入捕捉到了自回归基线无法恢复的疾病动力学,使SMB-Structure能够在高患者异质性特征的复杂任务上实现竞争力的性能。模型权重可在https://huggingface.co/standardmodelbio/SMB-v1-1.7B-Structure 获取。
Summary / 总结
This paper introduces SMB-Structure, a world model for structured Electronic Health Records (EHR) that combines next-token prediction with joint-embedding prediction to simulate patient dynamics rather than just predict tokens. It demonstrates that this approach captures disease dynamics better than autoregressive models, especially for tasks with high patient heterogeneity, using large-scale cohorts from Memorial Sloan Kettering and INSPECT to validate its effectiveness.
研究针对大型语言模型(LLMs)将患者视为静态文档的局限性,引入了SMB-Structure,一种针对结构化电子健康记录(EHRs)的世界模型。该模型结合了下一个词预测(SFT)与联合嵌入预测架构(JEPA),以模拟患者动态。实验结果显示,SMB-Structure能够学习捕捉疾病动态的嵌入,优于自回归基线模型,在复杂的、患者异质性高的任务上表现出色。
EditYourself: Audio-Driven Generation and Manipulation of Talking Head Videos with Diffusion Transformers
Authors: John Flynn, Wolfgang Paier, Dimitar Dinev, Sam Nhut Nguyen, Hayk Poghosyan, Manuel Toribio, Sandipan Banerjee, Guy Gafni
First: 2026-01-29T18:49:27+00:00 · Latest: 2026-01-29T18:49:27+00:00
Comments: Project page: https://edit-yourself.github.io/
Abstract
Current generative video models excel at producing novel content from text and image prompts, but leave a critical gap in editing existing pre-recorded videos, where minor alterations to the spoken script require preserving motion, temporal coherence, speaker identity, and accurate lip synchronization. We introduce EditYourself, a DiT-based framework for audio-driven video-to-video (V2V) editing that enables transcript-based modification of talking head videos, including the seamless addition, removal, and retiming of visually spoken content. Building on a general-purpose video diffusion model, EditYourself augments its V2V capabilities with audio conditioning and region-aware, edit-focused training extensions. This enables precise lip synchronization and temporally coherent restructuring of existing performances via spatiotemporal inpainting, including the synthesis of realistic human motion in newly added segments, while maintaining visual fidelity and identity consistency over long durations. This work represents a foundational step toward generative video models as practical tools for professional video post-production.
中文标题/摘要
标题:EditYourself:基于音频的对话头视频生成与操控
当前的生成视频模型在从文本和图像提示生成新颖内容方面表现出色,但在编辑现有的预录制视频方面存在关键缺口,其中对口语脚本的微小修改需要保留动作、时间连贯性、说话人身份和准确的唇部同步。我们介绍了EditYourself,一种基于DiT的音频驱动视频到视频(V2V)编辑框架,该框架允许基于脚本修改对话头视频,包括无缝添加、删除和调整视觉口语内容的时间。基于通用视频扩散模型,EditYourself通过音频条件和区域感知、编辑导向的训练扩展增强了其V2V能力。这使得通过时空修补精确唇部同步和时间连贯地重新结构现有表演成为可能,包括在新添加的段落中合成现实的人体运动,同时保持视觉保真度和身份一致性。这项工作代表了生成视频模型作为专业视频后期制作实用工具的基础步骤。
Summary / 总结
EditYourself is a DiT-based framework designed for audio-driven video-to-video editing, enabling modifications to existing talking head videos. It allows for the seamless addition, removal, and retiming of spoken content while preserving lip synchronization, temporal coherence, and speaker identity. Key findings include the ability to maintain visual fidelity and identity consistency over long durations, and the synthesis of realistic human motion in newly added segments, demonstrating significant progress in professional video post-production capabilities.
EditYourself 是一个基于 DiT 的框架,用于音频驱动的视频到视频编辑,能够对现有的谈话头视频进行添加、删除和重定时等修改,同时保持唇部同步、时间连贯性和说话者身份的一致性。关键发现包括在长时间内保持视觉保真度和身份一致性,并在新增段落中合成现实的人类运动,展示了在专业视频后期制作中的重大进展。
A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine
Authors: Anran Li, Yuanyuan Chen, Wenjun Long, Yu Yin, Yan Hu, Hyunjae Kim, Weipeng Zhou, Yujia Zhou, Hongyi Peng, Yang Ren, Xuguang Ai, Zhenyue Qin, Ming Hu, Xiaoxiao Li, Han Yu, Yih-Chung Tham, Lucila Ohno-Machado, Hua Xu, Qingyu Chen
First: 2026-01-29T18:48:21+00:00 · Latest: 2026-01-29T18:48:21+00:00
Comments: 38 pages, 9 tables, 3 figures
Abstract
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis. To enable their use in clinical settings, LLMs are typically further adapted through continued pretraining or post-training using clinical data. However, most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems. Federated learning (FL) is a promising solution for enabling collaborative model development across healthcare institutions. Yet applying FL to LLMs in medicine remains fundamentally limited. First, conventional FL requires transmitting the full model during each communication round, which becomes impractical for multi-billion-parameter LLMs given the limited computational resources. Second, many FL algorithms implicitly assume data homogeneity, whereas real-world clinical data are highly heterogeneous across patients, diseases, and institutional practices. We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications. Fed-MedLoRA transmits only low-rank adapter parameters, reducing communication and computation overhead, while Fed-MedLoRA+ further incorporates adaptive, data-aware aggregation to improve convergence under cross-site heterogeneity. We apply the framework to clinical information extraction (IE), which transforms patient narratives into structured medical entities and relations. Accuracy was assessed across five patient cohorts through comparisons with BERT models, and LLaMA-3 and DeepSeek-R1, GPT-4o models. Evaluation settings included (1) in-domain training and testing, (2) external validation on independent cohorts, and (3) a low-resource new-site adaptation scenario using real-world clinical notes from the Yale New Haven Health System.
中文标题/摘要
标题:一种用于医学领域大型语言模型训练的联邦和参数高效框架
大型语言模型(LLMs)在医学基准测试中表现出强大的性能,包括问答和诊断。为了在临床环境中使用这些模型,LLMs通常通过继续预训练或使用临床数据进行后训练进一步适应。然而,大多数医学LLMs仅在单一机构的数据上进行训练,这在通用性和异构系统中的安全性方面存在局限性。联邦学习(FL)是一种跨医疗机构进行协作模型开发的有前途的解决方案。然而,将FL应用于医学中的LLMs仍然存在根本限制。首先,传统的FL要求在每次通信轮次中传输完整的模型,对于多十亿参数的LLMs来说,这在计算资源有限的情况下变得不切实际。其次,许多FL算法隐含地假设数据同质性,而现实世界的临床数据在患者、疾病和机构实践方面高度异质。我们提出了一个适用于医学应用的模型通用且参数高效的联邦学习框架。Fed-MedLoRA仅传输低秩适配器参数,从而减少通信和计算开销,而Fed-MedLoRA+进一步结合了适应性和数据感知聚合,以在跨站点异质性下提高收敛性。我们通过与BERT模型、LLaMA-3和DeepSeek-R1、GPT-4o模型的比较,将该框架应用于临床信息提取(IE),将患者叙述转化为结构化的医学实体和关系。评估设置包括(1)领域内训练和测试,(2)独立队列的外部验证,以及(3)使用耶鲁纽黑文健康系统的真实世界临床笔记进行的低资源新站点适应场景。
Summary / 总结
The paper introduces Fed-MedLoRA and Fed-MedLoRA+, federated learning frameworks designed to adapt large language models (LLMs) for medical applications. These frameworks reduce the need to transmit full model parameters, making them suitable for multi-billion-parameter LLMs. Key findings show that Fed-MedLoRA and Fed-MedLoRA+ improve accuracy in clinical information extraction tasks, outperforming BERT, LLaMA-3, and DeepSeek-R1 models across various evaluation settings, including in-domain, external validation, and low-resource scenarios.
论文旨在通过联邦学习(FL)解决大型语言模型(LLMs)在医疗应用中的训练问题,以克服数据异质性和计算资源的限制。提出了Fed-MedLoRA和Fed-MedLoRA+框架,仅传输低秩适配器参数以减少通信和计算开销,并结合自适应、数据感知聚合以在跨站点异质性下提高收敛性。该框架在临床信息提取任务上进行了评估,展示了在各种场景下的改进准确性,包括领域内训练、外部验证和低资源适应场景。
SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence
Authors: Saoud Aldowaish, Yashwanth Karumanchi, Kai-Chen Chiang, Soroosh Noorzad, Morteza Fayazi
First: 2026-01-29T18:41:52+00:00 · Latest: 2026-01-29T18:41:52+00:00
Abstract
Current methods for converting circuit schematic images into machine-readable netlists struggle with component recognition and connectivity inference. In this paper, we present SINA, an open-source, fully automated circuit schematic image-to-netlist generator. SINA integrates deep learning for accurate component detection, Connected-Component Labeling (CCL) for precise connectivity extraction, and Optical Character Recognition (OCR) for component reference designator retrieval, while employing a Vision-Language Model (VLM) for reliable reference designator assignments. In our experiments, SINA achieves 96.47% overall netlist-generation accuracy, which is 2.72x higher than state-of-the-art approaches.
中文标题/摘要
标题:SINA:使用人工智能的电路原理图图像到网表生成器
当前将电路原理图图像转换为机器可读网表的方法在组件识别和连接推理方面存在困难。在本文中,我们介绍了SINA,这是一个开源的全自动电路原理图图像到网表生成器。SINA结合了深度学习进行准确的组件检测、连通分量标记(CCL)进行精确的连接提取以及光学字符识别(OCR)进行组件参考标号检索,同时采用视觉语言模型(VLM)进行可靠的参考标号分配。在我们的实验中,SINA的整体网表生成准确率为96.47%,比最先进的方法高出2.72倍。
Summary / 总结
SINA is an automated circuit schematic image-to-netlist generator that uses deep learning for component detection, CCL for connectivity extraction, OCR for reference designator retrieval, and a VLM for reliable reference designator assignments. It achieves 96.47% overall netlist-generation accuracy, which is 2.72 times higher than existing methods.
SINA 是一种自动化的电路原理图图像到网表生成器,使用深度学习进行元件检测、CCL 进行连接提取、OCR 进行参考标识符检索,以及 VLM 进行可靠的参考标识符分配。它实现了 96.47% 的整体网表生成准确率,比现有方法高 2.72 倍。
Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator
Authors: Qian Cao, Xiting Wang, Yuzhuo Yuan, Yahui Liu, Fang Luo, Ruihua Song
Venue: ICLR 2026
First: 2025-05-25T17:25:23+00:00 · Latest: 2026-01-29T18:38:43+00:00
Comments: Accepted by ICLR 2026
Abstract
Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, we propose a novel pairwise-comparison framework for assessing textual creativity that leverages shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human and synthetic data to train highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs.
中文标题/摘要
标题:跨领域评估文本创造力:一个数据集和大型语言模型评估器
大型语言模型(LLMs)的创造力评估仍然是一个具有挑战性的前沿领域。当前的评估主要依赖于低效且成本高昂的人类判断,阻碍了增强机器创造力的进步。虽然存在自动方法,包括心理测试到启发式或提示驱动的方法,但它们往往缺乏普适性或与人类判断的对齐。为了解决这些问题,我们提出了一种新的成对比较框架,用于评估文本创造力,该框架利用共享的上下文指令以提高评估一致性。我们引入了CreataSet,这是一个大规模数据集,包含10万多个与人类水平相当和100万多个合成创造性指令-响应对,覆盖了多种开放领域任务。通过在CreataSet上进行训练,我们开发了一个基于LLM的评估器CrEval。CrEval在与人类判断的对齐方面表现出显著的优势。实验结果强调了整合人类和合成数据以训练高度稳健的评估器的不可或缺的重要性,并展示了CrEval在提升LLM创造力方面的实际用途。
Summary / 总结
The paper addresses the challenge of evaluating text creativity for large language models (LLMs) by proposing a novel pairwise-comparison framework, CreataSet, and an LLM-based evaluator named CrEval. CreataSet consists of 100K+ human-level and 1M+ synthetic creative instruction-response pairs across various tasks. CrEval shows superior alignment with human judgments compared to existing methods, highlighting the importance of using both human and synthetic data for robust evaluation and enhancing LLM creativity.
论文提出了一种新颖的成对比较框架CreataSet和基于LLM的评估器CrEval,以解决大型语言模型(LLMs)文本创造力评估的挑战。CreataSet包含100K+的人类级和1M+合成的创意指令-响应对,覆盖多种任务。CrEval在与人类判断的对齐方面优于现有方法,强调使用人类和合成数据进行稳健评估和增强LLM创造力的重要性。
Value-Based Pre-Training with Downstream Feedback
Authors: Shuqi Ke, Giulia Fanti
First: 2026-01-29T18:38:09+00:00 · Latest: 2026-01-29T18:38:09+00:00
Abstract
Can a small amount of verified goal information steer the expensive self-supervised pretraining of foundation models? Standard pretraining optimizes a fixed proxy objective (e.g., next-token prediction), which can misallocate compute away from downstream capabilities of interest. We introduce V-Pretraining: a value-based, modality-agnostic method for controlled continued pretraining in which a lightweight task designer reshapes the pretraining task to maximize the value of each gradient step. For example, consider self-supervised learning (SSL) with sample augmentation. The V-Pretraining task designer selects pretraining tasks (e.g., augmentations) for which the pretraining loss gradient is aligned with a gradient computed over a downstream task (e.g., image segmentation). This helps steer pretraining towards relevant downstream capabilities. Notably, the pretrained model is never updated on downstream task labels; they are used only to shape the pretraining task. Under matched learner update budgets, V-Pretraining of 0.5B--7B language models improves reasoning (GSM8K test Pass@1) by up to 18% relative over standard next-token prediction using only 12% of GSM8K training examples as feedback. In vision SSL, we improve the state-of-the-art results on ADE20K by up to 1.07 mIoU and reduce NYUv2 RMSE while improving ImageNet linear accuracy, and we provide pilot evidence of improved token efficiency in continued pretraining.
中文标题/摘要
标题:基于价值的预训练与下游反馈
少量验证过的目标信息能否引导基础模型昂贵的自监督预训练?标准预训练优化固定代理目标(例如,下一个词预测),这可能会将计算资源错误地分配到不感兴趣的下游能力上。我们提出了V-预训练:一种基于价值、模态无关的方法,通过轻量级任务设计师调整预训练任务,以最大化每个梯度步的价值。例如,考虑样本增强的自监督学习(SSL)。V-预训练任务设计师选择预训练任务(例如,增强方法),使得预训练损失梯度与下游任务(例如,图像分割)的梯度对齐。这有助于引导预训练朝着相关的下游能力。值得注意的是,预训练模型从未在下游任务标签上更新;它们仅用于塑造预训练任务。在匹配学习者更新预算下,使用仅12%的GSM8K训练示例作为反馈,V-预训练0.5B-7B语言模型在GSM8K测试Pass@1上相对改进了高达18%,优于标准的下一个词预测。在视觉SSL中,我们通过最多1.07 mIoU改进了ADE20K的最新结果,同时减少NYUv2 RMSE并提高ImageNet线性精度,并提供了持续预训练中改进的标记效率的初步证据。
Summary / 总结
The research aims to address the issue of misallocation of computational resources during self-supervised pretraining of foundation models. V-Pretraining is introduced as a method that uses downstream task feedback to reshape the pretraining task, aligning it with desired downstream capabilities. This method requires only a small amount of verified goal information and does not update the pretrained model on downstream task labels. Key experimental results show that V-Pretraining improves reasoning tasks by up to 18% relative to standard next-token prediction using only 12% of training examples, and it also enhances vision SSL tasks like ADE20K and NYUv2 with improved metrics such as mIoU and RMSE.
研究旨在解决基础模型自监督预训练过程中计算资源分配不当的问题。V-预训练方法通过下游任务反馈重新塑造预训练任务,使其与所需下游能力对齐。该方法仅需少量验证目标信息,并不更新预训练模型的下游任务标签。关键实验结果表明,V-预训练在推理任务上相对标准的下一个词预测提高了高达18%,仅使用12%的训练示例;同时在视觉自监督学习任务如ADE20K和NYUv2上也取得了改进,提高了mIoU和RMSE等指标,并提升了ImageNet线性精度。
ECO: Quantized Training without Full-Precision Master Weights
Authors: Mahdi Nikdan, Amir Zandieh, Dan Alistarh, Vahab Mirrokni
First: 2026-01-29T18:35:01+00:00 · Latest: 2026-01-29T18:35:01+00:00
Abstract
Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in high-precision: concretely, gradient updates must be applied to a high-precision weight buffer, known as $\textit{master weights}$. This buffer introduces substantial memory overhead, particularly for Sparse Mixture of Experts (SMoE) models, where model parameters and optimizer states dominate memory usage. To address this, we introduce the Error-Compensating Optimizer (ECO), which eliminates master weights by applying updates directly to quantized parameters. ECO quantizes weights after each step and carefully injects the resulting quantization error into the optimizer momentum, forming an error-feedback loop with no additional memory. We prove that, under standard assumptions and a decaying learning rate, ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate. We show empirical results for pretraining small Transformers (30-800M), a Gemma-3 1B model, and a 2.1B parameter Sparse MoE model with FP8 quantization, and fine-tuning DeepSeek-MoE-16B in INT4 precision. Throughout, ECO matches baselines with master weights up to near-lossless accuracy, significantly shifting the static memory vs validation loss Pareto frontier.
中文标题/摘要
标题:ECO:无需全精度主权重的量化训练
量化显著提高了大型语言模型(LLM)训练的计算和内存效率。然而,现有方法仍然依赖于在高精度下累积更新:具体来说,梯度更新必须应用于高精度权重缓冲区,称为“主权重”。该缓冲区引入了大量内存开销,特别是在稀疏混合专家(SMoE)模型中,模型参数和优化器状态占主导地位。为了解决这个问题,我们引入了误差补偿优化器(ECO),它通过直接对量化参数进行更新来消除主权重。ECO 在每一步后量化权重,并仔细将由此产生的量化误差注入到优化器动量中,形成一个无需额外内存的误差反馈循环。我们证明,在标准假设和衰减的学习率下,ECO 收敛到最优解的恒定半径邻域,而简单的主权重移除可能会导致与学习率成反比的误差。我们展示了针对小型Transformer(30-800M)、Gemma-3 1B 模型和具有 FP8 量化参数的 2.1B 参数稀疏 MoE 模型的预训练结果,以及 DeepSeek-MoE-16B 在 INT4 精度下的微调结果。在整个过程中,ECO 在主权重基线上的准确率几乎无损,显著地改变了静态内存与验证损失的帕累托前沿。
Summary / 总结
The research aims to improve the memory efficiency of Large Language Model (LLM) training by eliminating the need for high-precision master weights. The Error-Compensating Optimizer (ECO) directly applies updates to quantized parameters, reducing memory overhead. Experiments show that ECO achieves near-lossless accuracy compared to baselines with master weights, particularly for models like Transformers and Sparse Mixture of Experts, while shifting the memory vs validation loss trade-off.
该论文提出了一种名为ECO的优化器,通过在每一步后量化权重并将量化误差注入优化器动量中,从而消除高精度主权重的需要,减少内存开销。实验结果显示,ECO在各种模型大小上,包括Transformer和稀疏混合专家模型,可以达到与具有主权重的传统方法相近的性能,同时显著降低静态内存使用量。
RefAny3D: 3D Asset-Referenced Diffusion Models for Image Generation
Authors: Hanzhuo Huang, Qingyang Bao, Zekai Gu, Zhongshuo Du, Cheng Lin, Yuan Liu, Sibei Yang
Venue: ICLR 2026
First: 2026-01-29T18:30:10+00:00 · Latest: 2026-01-29T18:30:10+00:00
Comments: ICLR 2026. Project page: https://judgementh.github.io/RefAny3D Codes: https://github.com/JudgementH/RefAny3D
Abstract
In this paper, we propose a 3D asset-referenced diffusion model for image generation, exploring how to integrate 3D assets into image diffusion models. Existing reference-based image generation methods leverage large-scale pretrained diffusion models and demonstrate strong capability in generating diverse images conditioned on a single reference image. However, these methods are limited to single-image references and cannot leverage 3D assets, constraining their practical versatility. To address this gap, we present a cross-domain diffusion model with dual-branch perception that leverages multi-view RGB images and point maps of 3D assets to jointly model their colors and canonical-space coordinates, achieving precise consistency between generated images and the 3D references. Our spatially aligned dual-branch generation architecture and domain-decoupled generation mechanism ensure the simultaneous generation of two spatially aligned but content-disentangled outputs, RGB images and point maps, linking 2D image attributes with 3D asset attributes. Experiments show that our approach effectively uses 3D assets as references to produce images consistent with the given assets, opening new possibilities for combining diffusion models with 3D content creation.
中文标题/摘要
标题:RefAny3D:基于3D资产的扩散模型图像生成
在本文中,我们提出了一种基于3D资产的扩散模型,探索如何将3D资产整合到图像扩散模型中。现有的基于参考的图像生成方法利用大规模预训练的扩散模型,并展示了在单张参考图像条件下生成多样化图像的强大能力。然而,这些方法仅限于单张图像参考,无法利用3D资产,限制了其实用的灵活性。为了解决这一差距,我们提出了一种跨域扩散模型,具有双分支感知机制,利用多视角RGB图像和3D资产的点图来联合建模它们的颜色和标准空间坐标,实现生成图像与3D参考之间的精确一致性。我们的空间对齐双分支生成架构和域解耦生成机制确保同时生成两个空间对齐但内容分离的输出,RGB图像和点图,将2D图像属性与3D资产属性联系起来。实验表明,我们的方法有效地利用3D资产作为参考,生成与给定资产一致的图像,为将扩散模型与3D内容创作相结合开辟了新的可能性。
Summary / 总结
This paper introduces RefAny3D, a 3D asset-referenced diffusion model for image generation, addressing the limitation of existing methods that rely on single-image references. The model uses multi-view RGB images and point maps of 3D assets to generate images that precisely match the given 3D references. Experiments demonstrate that RefAny3D can effectively utilize 3D assets as references, enhancing the practical versatility of image generation methods.
该论文提出了RefAny3D,一种基于3D资产的扩散模型,旨在解决现有方法只能使用单张参考图像的局限性。它提出了一种跨域扩散模型,结合了双分支感知,利用多视角RGB图像和3D资产的点图来生成与给定3D资产精确一致的图像。该模型确保同时生成RGB图像和点图,将2D图像属性与3D资产属性联系起来。实验表明,RefAny3D能够有效利用3D资产作为参考生成与给定3D资产一致的图像,增强了图像生成模型的实用灵活性。
NeuroFaith: Evaluating LLM Self-Explanation Faithfulness via Internal Representation Alignment
Authors: Milan Bhan, Jean-Noel Vittaut, Nicolas Chesneau, Sarath Chandar, Marie-Jeanne Lesot
First: 2025-06-10T22:30:53+00:00 · Latest: 2026-01-29T18:24:42+00:00
Abstract
Large Language Models (LLMs) can generate plausible free text self-explanations to justify their answers. However, these natural language explanations may not accurately reflect the model's actual reasoning process, pinpointing a lack of faithfulness. Existing faithfulness evaluation methods rely primarily on behavioral tests or computational block analysis without examining the semantic content of internal neural representations. This paper proposes NeuroFaith, a flexible framework that measures the faithfulness of LLM free text self-explanation by identifying key concepts within explanations and mechanistically testing whether these concepts actually influence the model's predictions. We show the versatility of NeuroFaith across 2-hop reasoning and classification tasks. Additionally, we develop a linear faithfulness probe based on NeuroFaith to detect unfaithful self-explanations from representation space and improve faithfulness through steering. NeuroFaith provides a principled approach to evaluating and enhancing the faithfulness of LLM free text self-explanations, addressing critical needs for trustworthy AI systems.
中文标题/摘要
标题:NeuroFaith:通过内部表示一致性评估LLM自我解释的可信度
大型语言模型(LLMs)可以生成可信的自然语言自我解释来证明其答案。然而,这些自然语言解释可能并不准确反映模型的实际推理过程,揭示了可信度不足的问题。现有的可信度评估方法主要依赖于行为测试或计算块分析,而不检查内部神经表示的语义内容。本文提出了一种名为NeuroFaith的灵活框架,通过识别解释中的关键概念并机械性地测试这些概念是否实际上影响模型的预测来衡量LLM自由文本自我解释的可信度。我们展示了NeuroFaith在2跳推理和分类任务中的灵活性。此外,我们基于NeuroFaith开发了一种线性可信度探针,用于从表示空间检测不准确的自我解释,并通过引导提高可信度。NeuroFaith提供了一种评估和增强LLM自由文本自我解释可信度的原理性方法,满足了对可信赖AI系统的关键需求。
Summary / 总结
The research aims to evaluate the faithfulness of LLM self-explanations by aligning internal representations. The method involves identifying key concepts in the explanations and testing if these concepts influence the model's predictions. Key findings include the framework's versatility across different tasks and the development of a linear faithfulness probe to detect and improve unfaithful self-explanations.
研究旨在通过内部表示的对齐来评估大型语言模型自我解释的忠实性,方法是识别解释中的关键概念并测试这些概念是否影响模型的预测。主要发现包括该框架在不同任务中的灵活性以及开发了一种线性忠实性探针来检测和改进不忠实的自我解释。
Think Locally, Explain Globally: Graph-Guided LLM Investigations via Local Reasoning and Belief Propagation
Authors: Saurabh Jha, Rohan Arora, Bhavya, Noah Zheutlin, Paulina Toro Isaza, Laura Shwartz, Yu Deng, Daby Sow, Ruchi Mahindru, Ruchir Puri
First: 2026-01-25T17:27:19+00:00 · Latest: 2026-01-29T18:18:39+00:00
Abstract
LLM agents excel when environments are mostly static and the needed information fits in a model's context window, but they often fail in open-ended investigations where explanations must be constructed by iteratively mining evidence from massive, heterogeneous operational data. These investigations exhibit hidden dependency structure: entities interact, signals co-vary, and the importance of a fact may only become clear after other evidence is discovered. Because the context window is bounded, agents must summarize intermediate findings before their significance is known, increasing the risk of discarding key evidence. ReAct-style agents are especially brittle in this regime. Their retrieve-summarize-reason loop makes conclusions sensitive to exploration order and introduces run-to-run non-determinism, producing a reliability gap where Pass-at-k may be high but Majority-at-k remains low. Simply sampling more rollouts or generating longer reasoning traces does not reliably stabilize results, since hypotheses cannot be autonomously checked as new evidence arrives and there is no explicit mechanism for belief bookkeeping and revision. In addition, ReAct entangles semantic reasoning with controller duties such as tool orchestration and state tracking, so execution errors and plan drift degrade reasoning while consuming scarce context.
We address these issues by formulating investigation as abductive reasoning over a dependency graph and proposing EoG (Explanations over Graphs), a disaggregated framework in which an LLM performs bounded local evidence mining and labeling (cause vs symptom) while a deterministic controller manages traversal, state, and belief propagation to compute a minimal explanatory frontier. On a representative ITBench diagnostics task, EoG improves both accuracy and run-to-run consistency over ReAct baselines, including a 7x average gain in Majority-at-k entity F1.
中文标题/摘要
标题:立足本地,放眼全球:通过局部推理和信念传播的图引导LLM调查
当环境基本稳定且所需信息可以容纳在模型的上下文窗口中时,LLM代理表现出色,但在需要迭代挖掘大量异构操作数据以构建解释的开放性调查中,它们经常失败。这些调查具有隐藏的依赖结构:实体相互作用,信号共变,事实的重要性可能在发现其他证据后才变得清晰。由于上下文窗口是有限的,代理必须在已知其重要性之前总结中间发现,增加了丢弃关键证据的风险。ReAct风格的代理在这种情况下尤其脆弱。它们的检索-总结-推理循环使得结论对探索顺序敏感,并引入了运行间非确定性,导致可靠性差距,即使Pass-at-k可能很高,但Majority-at-k仍然较低。简单地增加采样次数或生成更长的推理轨迹并不能可靠地稳定结果,因为随着新证据的到来,假设无法自主验证,也没有明确的机制来记录和修订信念。此外,ReAct将语义推理与控制器职责(如工具编排和状态跟踪)纠缠在一起,执行错误和计划漂移会降低推理能力,消耗稀缺的上下文。
我们通过将调查形式化为依赖图上的 abduction 推理,并提出 EoG(图上的解释),一种分解框架来解决这些问题,在该框架中,LLM执行有界的局部证据挖掘和标记(原因 vs 症状),而确定性控制器管理遍历、状态和信念传播以计算最小的解释前沿。在代表性的ITBench诊断任务中,EoG在准确性和运行间一致性方面均优于ReAct基线,平均Majority-at-k实体F1提高了7倍。
Summary / 总结
The paper addresses the limitations of LLM agents in open-ended investigations where explanations must be constructed iteratively from massive, heterogeneous data. It proposes EoG (Explanations over Graphs), a framework that disaggregates semantic reasoning from controller duties, using local reasoning and belief propagation. On a diagnostics task, EoG improves both accuracy and run-to-run consistency, achieving a 7x average gain in Majority-at-k entity F1 over ReAct baselines.
论文通过提出EoG(基于图的解释)框架解决了LLM在开放性调查中的局限性,该框架将推理过程分解。EoG 使用依赖图来引导局部推理和信念传播,使LLM专注于挖掘和标记证据,而确定性的控制器则管理遍历、状态和信念传播。在ITBench诊断任务上,EoG在准确性和运行间一致性方面显著优于ReAct基线,平均提高了7倍的Majority-at-k实体F1得分。
Where Do the Joules Go? Diagnosing Inference Energy Consumption
Authors: Jae-Won Chung, Ruofan Wu, Jeff J. Ma, Mosharaf Chowdhury
First: 2026-01-29T18:16:45+00:00 · Latest: 2026-01-29T18:16:45+00:00
Comments: The ML.ENERGY Leaderboard v3.0 is open https://ml.energy/leaderboard
Abstract
Energy is now a critical ML computing resource. While measuring energy consumption and observing trends is a valuable first step, accurately understanding and diagnosing why those differences occur is crucial for optimization. To that end, we begin by presenting a large-scale measurement study of inference time and energy across the generative AI landscape with 46 models, 7 tasks, and 1,858 different configurations on NVIDIA H100 and B200 GPUs. Our empirical findings span order-of-magnitude variations: LLM task type can lead to 25$\times$ energy differences, video generation sometimes consumes more than 100$\times$ the energy of images, and GPU utilization differences can result in 3--5$\times$ energy differences. Based on our observations, we present a framework for reasoning about the underlying mechanisms that govern time and energy consumption. The essence is that time and energy are determined by latent metrics like memory and utilization, which are in turn affected by various factors across the algorithm, software, and hardware layers. Our framework also extends directly to throughput per watt, a critical metric for power-constrained datacenters.
中文标题/摘要
标题:能量去向何方?诊断推理能耗消耗
能源现在是关键的ML计算资源。虽然测量能耗和观察趋势是一个有价值的初步步骤,但准确理解并诊断这些差异发生的原因对于优化至关重要。为此,我们首先通过在NVIDIA H100和B200 GPU上对46个模型、7个任务和1,858种不同配置进行大规模测量研究,展示了生成AI领域的推理时间和能耗情况。我们的实证发现涵盖了数量级的差异:LLM任务类型可能导致25倍的能耗差异,视频生成有时消耗的能量超过图像的100倍,而GPU利用率差异可能导致3到5倍的能耗差异。基于我们的观察,我们提出了一种关于时间与能耗消耗背后机制的推理框架。核心在于时间与能耗由潜在指标如内存和利用率决定,而这些指标又受到算法、软件和硬件各层因素的影响。我们的框架还直接扩展到每瓦吞吐量,这是受电力限制的数据中心的关键指标。
Summary / 总结
This study investigates the energy consumption of inference tasks in generative AI, measuring 46 models across 7 tasks and 1,858 configurations on NVIDIA H100 and B200 GPUs. It finds significant energy differences, with LLM tasks consuming up to 25 times more energy than others, and video generation using over 100 times the energy of image generation. The research proposes a framework to understand these differences, attributing them to factors like memory and utilization across algorithm, software, and hardware layers, and extends to throughput per watt, crucial for power-constrained datacenters.
研究探讨了生成AI模型推理过程中的能耗问题,测量了46个模型、7个任务和1,858种配置在NVIDIA H100和B200 GPU上的能耗。研究发现能耗差异显著,LLM任务的能耗最高可达其他任务的25倍,视频生成的能耗是图像生成的100多倍。作者提出了一种框架来理解影响时间和能耗的底层机制,包括算法、软件和硬件层面上的各种因素。该框架还适用于每瓦吞吐量,这是受限于电力的数据中心的关键指标。
PRISM: A Framework Harnessing Unsupervised Visual Representations and Textual Prompts for Explainable MACE Survival Prediction from Cardiac Cine MRI
Authors: Haoyang Su, Jin-Yi Xiang, Shaohao Rui, Yifan Gao, Xingyu Chen, Tingxuan Yin, Shaoting Zhang, Xiaosong Wang, Lian-Ming Wu
First: 2025-08-26T17:23:43+00:00 · Latest: 2026-01-29T18:11:17+00:00
Abstract
Accurate prediction of major adverse cardiac events (MACE) remains a central challenge in cardiovascular prognosis. We present PRISM (Prompt-guided Representation Integration for Survival Modeling), a self-supervised framework that integrates visual representations from non-contrast cardiac cine magnetic resonance imaging with structured electronic health records (EHRs) for survival analysis. PRISM extracts temporally synchronized imaging features through motion-aware multi-view distillation and modulates them using medically informed textual prompts to enable fine-grained risk prediction. Across four independent clinical cohorts, PRISM consistently surpasses classical survival prediction models and state-of-the-art (SOTA) deep learning baselines under internal and external validation. Further clinical findings demonstrate that the combined imaging and EHR representations derived from PRISM provide valuable insights into cardiac risk across diverse cohorts. Three distinct imaging signatures associated with elevated MACE risk are uncovered, including lateral wall dyssynchrony, inferior wall hypersensitivity, and anterior elevated focus during diastole. Prompt-guided attribution further identifies hypertension, diabetes, and smoking as dominant contributors among clinical and physiological EHR factors.
中文标题/摘要
标题:PRISM:一种利用无监督视觉表示和文本提示进行可解释的MACE生存预测的心脏 cine MRI框架
准确预测主要不良心脏事件(MACE)仍然是心血管预后中的一个核心挑战。我们提出了PRISM(提示引导的表示集成生存建模),这是一种自监督框架,将非对比心脏cine磁共振成像的视觉表示与结构化的电子健康记录(EHRs)结合用于生存分析。PRISM通过运动感知的多视图蒸馏提取时间同步的影像特征,并使用医学上知情的文本提示对其进行调节,以实现细粒度的风险预测。在四个独立的临床队列中,PRISM在内部和外部验证中始终优于经典的生存预测模型和最先进的(SOTA)深度学习基线。进一步的临床发现表明,PRISM提取的影像和EHR表示为不同队列中的心脏风险提供了有价值的见解。发现了三种与MACE风险增加相关的影像特征,包括侧壁异步、下壁高敏感性和舒张期前壁升高焦点。提示引导的归因进一步确定了高血压、糖尿病和吸烟是临床和生理EHR因素中的主要贡献者。
Summary / 总结
PRISM is a self-supervised framework that integrates visual representations from cardiac cine MRI with EHRs for MACE survival prediction. It uses motion-aware multi-view distillation and medically informed textual prompts to extract synchronized imaging features and modulate them for fine-grained risk prediction. PRISM outperforms classical and SOTA models in four clinical cohorts, providing valuable insights into cardiac risk and identifying imaging signatures and EHR factors associated with MACE risk.
PRISM 是一个自监督框架,结合心脏 cine MRI 的视觉表示和 EHRs 进行 MACE 预测。它使用运动感知的多视图蒸馏和文本提示来提取和调节影像特征,相比传统和深度学习模型表现出更优的性能。关键发现包括与 MACE 风险相关的三种影像特征,以及高血压、糖尿病和吸烟作为主要临床因素对心脏风险的贡献。