COMIC: Agentic Sketch Comedy Generation
Authors: Susung Hong, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz
First: 2026-03-11T17:59:59+00:00 · Latest: 2026-03-11T17:59:59+00:00
Comments: Project page: https://susunghong.github.io/COMIC/
Abstract
We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a corpus of comedy videos on YouTube to automatically evaluate humor. Our experiments show that our framework produces results approaching the quality of professionally produced sketches while demonstrating state-of-the-art performance in video generation.
中文标题/摘要
标题:COMIC: 代理素描喜剧生成
我们提出了一种全自动的人工智能系统,能够生成类似于周六夜现场等素描节目的短喜剧视频。该系统从角色参考开始,采用了一群基于真实制作工作室角色的代理人群,通过迭代竞争、评估和改进来优化质量和输出的多样性。一个关键贡献是通过分析YouTube上的喜剧视频语料库,引入了与真实观众偏好对齐的LLM评论家,以自动评估幽默感。我们的实验表明,该框架生成的结果接近专业制作素描的质量,同时在视频生成方面表现出最先进的性能。
Summary / 总结
The research aims to develop an AI system that can generate short comedic videos similar to sketch shows. It uses a population of AI agents representing real production roles, which iteratively compete and improve to optimize the quality and diversity of ideas. The system includes LLM critics trained on YouTube comedy videos to evaluate humor. Experiments show that the system produces results comparable to professionally produced sketches and demonstrates state-of-the-art performance in video generation.
研究旨在开发一个能够生成类似于脱口秀的短喜剧视频的AI系统。该系统使用代表真实制作角色的AI代理群体,通过迭代竞争和改进来优化创意和输出的质量与多样性。系统还包括通过分析YouTube喜剧视频来训练的LLM评论家,以评估幽默感。实验表明,该系统生成的结果与专业制作的草稿相媲美,并在视频生成方面表现出最先进的性能。
Differential Privacy in Machine Learning: A Survey from Symbolic AI to LLMs
Authors: Francisco Aguilera-Martínez, Fernando Berzal
First: 2025-06-13T11:30:35+00:00 · Latest: 2026-03-11T17:59:42+00:00
Abstract
Machine learning models should not reveal particular information that is not otherwise accessible. Differential privacy provides a formal framework to mitigate privacy risks by ensuring that the inclusion or exclusion of any single data point does not significantly alter the output of an algorithm, thus limiting the exposure of private information. This survey reviews the foundational definitions of differential privacy and traces their evolution through key theoretical and applied contributions. It then provides an in-depth examination of how DP has been integrated into machine learning models, analyzing existing proposals and methods to preserve privacy when training ML models. Finally, it describes how DP-based ML techniques can be evaluated in practice. By offering a comprehensive overview of differential privacy in machine learning, this work aims to contribute to the ongoing development of secure and responsible AI systems.
中文标题/摘要
标题:机器学习中的差分隐私:从符号人工智能到大语言模型的综述
机器学习模型不应泄露未公开的特定信息。差分隐私提供了一种正式框架,通过确保任何单个数据点的包含或排除不会显著改变算法的输出,从而限制私人信息的暴露,以减轻隐私风险。本文综述了差分隐私的基础定义及其通过关键理论和应用贡献的发展演变。然后,本文深入探讨了差分隐私如何被集成到机器学习模型中,分析了现有保护隐私的方法和方案,以在训练机器学习模型时保持隐私。最后,本文描述了如何在实践中评估基于差分隐私的机器学习技术。通过提供差分隐私在机器学习中的全面概述,本文旨在为安全和负责任的人工智能系统的持续发展做出贡献。
Summary / 总结
This survey explores the application of differential privacy in machine learning, starting from foundational definitions and tracing their evolution. It reviews how differential privacy has been integrated into ML models to protect privacy, analyzes existing methods for preserving privacy during training, and evaluates these techniques in practice. The study aims to contribute to the development of secure and responsible AI systems.
研究旨在通过使用差分隐私确保机器学习模型不泄露敏感信息,差分隐私限制了单个数据点对算法输出的影响。研究回顾了差分隐私从基础定义到在机器学习模型中的集成演变,分析了各种保护隐私的训练方法。关键发现包括在实际环境中评估基于差分隐私的机器学习技术,为安全的人工智能系统的发展做出贡献。
V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
Authors: Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. Bryan
First: 2026-03-11T17:59:40+00:00 · Latest: 2026-03-11T17:59:40+00:00
Comments: Project page: https://genjib.github.io/v2m_zero/
Abstract
Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/
中文标题/摘要
标题:V2M-Zero:零配对时间对齐视频到音乐生成
现有文本到音乐模型在生成与视频事件时间对齐的音乐方面面临挑战,缺乏精细的时间控制。我们提出了V2M-Zero,一种零配对视频到音乐生成方法,输出与视频时间对齐的音乐。我们的方法受到一个关键观察的启发:时间同步需要匹配何时以及发生了多少变化,而不是发生了什么变化。尽管音乐事件和视觉事件在语义上不同,但它们在每个模态内表现出共享的时间结构。我们通过使用预训练的音乐和视频编码器计算的内模态相似性事件曲线来捕捉这种结构。通过独立测量每个模态内的时间变化,这些曲线提供了跨模态的可比表示。这使得一种简单的训练策略成为可能:在音乐事件曲线上微调文本到音乐模型,然后在推理时替换视频事件曲线,无需跨模态训练或配对数据。在OES-Pub、MovieGenBench-Music和AIST++上,V2M-Zero在音频质量、语义对齐、时间同步和舞蹈视频的节拍对齐方面分别比配对数据基线高出5-21%、13-15%、21-52%和28%。通过大规模众包听觉测试,我们发现了类似的结果。总体而言,我们的结果验证了通过模态内特征进行时间对齐,而不是跨模态配对监督,对于视频到音乐生成是有效的。结果可在https://genjib.github.io/v2m_zero/获取。
Summary / 总结
V2M-Zero is a zero-pair video-to-music generation approach that outputs time-aligned music for video. It captures shared temporal structure through event curves computed from intra-modal similarity using pretrained encoders. V2M-Zero achieves substantial gains over paired-data baselines, with 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. A large crowd-source subjective listening test also supports these findings, validating the effectiveness of temporal alignment through within-modality features for video-to-music generation.
V2M-Zero 是一种零配对视频到音乐生成方法,能够输出与视频时间对齐的音乐。它通过预训练的音乐和视频编码器计算的内在模态相似性中的事件曲线来捕捉时间结构。V2M-Zero 在配对数据基线之上取得了显著的改进,包括 5-21% 的音频质量提升、13-15% 的语义对齐改善、21-52% 的时间对齐改进以及舞蹈视频中 28% 的节拍对齐提升。大规模众包听觉测试结果也支持这些发现。
DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving
Authors: Shuyao Shang, Bing Zhan, Yunfei Yan, Yuqi Wang, Yingyan Li, Yasong An, Xiaoman Wang, Jierui Liu, Lu Hou, Lue Fan, Zhaoxiang Zhang, Tieniu Tan
First: 2026-03-11T17:59:31+00:00 · Latest: 2026-03-11T17:59:31+00:00
Comments: 18 pages, 10 figures
Abstract
We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.
中文标题/摘要
标题:DynVLA:自主驾驶中行动推理的世界动力学学习
我们提出了DynVLA,一种驾驶VLA模型,引入了一种新的CoT范式,称为动力学CoT。DynVLA在行动生成之前预测紧凑的动力学,使决策更加明智且物理上合理。为了获得紧凑的动力学表示,DynVLA引入了动力学分词器,将未来演变压缩为少量的动力学令牌。考虑到交互密集型驾驶场景中的丰富环境动力学,DynVLA解耦了以自我为中心和环境为中心的动力学,从而更准确地建模世界动力学。然后,我们通过SFT和RFT训练DynVLA在行动之前生成动力学令牌,提高决策质量同时保持低延迟的推理。与缺乏精细时空理解的文本CoT相比,以及由于密集图像预测而引入大量冗余的视觉CoT,动力学CoT以紧凑、可解释和高效的形式捕捉世界演变。在NAVSIM、Bench2Drive和一个大规模的内部数据集上的广泛实验表明,DynVLA在Textual CoT和Visual CoT方法上始终表现出色,验证了动力学CoT的有效性和实际价值。
Summary / 总结
DynVLA is a driving VLA model that introduces Dynamics CoT, a new CoT paradigm for action reasoning in autonomous driving. It forecasts compact world dynamics before action generation, using a Dynamics Tokenizer to compress future evolution into a small set of dynamics tokens. Experiments show that DynVLA outperforms Textual CoT and Visual CoT methods on various datasets, demonstrating the effectiveness of Dynamics CoT in capturing compact, interpretable, and efficient world dynamics.
DynVLA 是一种驾驶 VLA 模型,引入了 Dynamics CoT 新的 CoT 帕累托,用于自动驾驶中的行动推理。它在行动生成前预测紧凑的世界动态,使用 Dynamics Tokenizer 将未来演变压缩成少量的令牌。DynVLA 分解了以自我为中心和环境为中心的动力学,以实现更准确的动力学建模,并通过 SFT 和 RFT 训练来提高决策质量同时保持延迟高效的推理。实验表明,DynVLA 在各种数据集上优于 Textual CoT 和 Visual CoT 方法,验证了 Dynamics CoT 的有效性。
Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge
Authors: Mingyang Song, Mao Zheng, Chenning Xu
First: 2026-03-11T17:50:38+00:00 · Latest: 2026-03-11T17:50:38+00:00
Abstract
The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that challenge this assumption. \textbf{First}, we demonstrate that this consensus is frequently illusory. We identify and formalize \textbf{Evaluation Illusion}, a phenomenon where LLM judges generate sophisticated critiques yet anchor scores on shared surface heuristics rather than substantive quality. Through a large-scale study of 105,600 evaluation instances (32 LLMs $\times$ 3 frontier judges $\times$ 100 tasks $\times$ 11 temperatures), we show that model-level agreement (Spearman $ρ= 0.99$) masks fragile sample-level agreement (Pearson $\bar{r} = 0.72$; absolute agreement ICC $= 0.67$), that merely sharing rubric structure restores 62\% of total agreement, and that high-quality outputs paradoxically receive the \textit{least} consistent evaluations. \textbf{Second}, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment. We introduce MERG (Metacognitive Enhanced Rubric Generation), a knowledge-driven rubric generation framework whose domain-selective effects confirm this. Agreement \textit{increases} in codified domains (Education +22\%, Academic +27\%) where knowledge anchors evaluators on shared standards, while it decreases in subjective domains where genuine evaluative pluralism emerges. These findings suggest that evaluation rubrics should be dynamically enriched with expert knowledge rather than relying on generic criteria, with implications for reward modeling in RLAIF.
中文标题/摘要
标题:超越共识的幻象:从表面启发式到基于知识的评估在LLM作为法官中的应用
LLM作为法官的范式依赖于一个关键假设,即高评价者间的一致性表明了可靠和客观的评估。我们提出了两项互补的发现来挑战这一假设。首先,我们证明这种一致性经常是幻象。我们识别并形式化了评估幻象这一现象,即LLM法官生成了复杂的批评,但评分却基于共享的表面启发式而非实质质量。通过一项涉及105,600次评估实例(32个LLM × 3个前沿法官 × 100个任务 × 11种温度)的大规模研究,我们展示了模型层面的一致性(斯皮尔曼ρ=0.99)掩盖了脆弱的样本层面一致性(皮尔逊r̄=0.72;绝对一致性ICC=0.67),共享评分标准结构恢复了62%的总一致性,并且高质量的输出反而收到了最不一致的评价。其次,我们证明了动态生成基于领域知识的评估标准能产生更有意义的评估。我们引入了MERG(元认知增强评分标准生成),这是一种知识驱动的评分标准生成框架,其领域选择性效果证实了这一点。在编码领域(教育+22%,学术+27%)中,一致性增加,因为知识使评估者基于共享标准,而在主观领域中,真正的评估多元性出现,一致性减少。这些发现表明,评分标准应动态地与专家知识相结合,而不是依赖于通用标准,这对RLAIF中的奖励建模具有重要意义。
Summary / 总结
The study challenges the assumption that high inter-evaluator agreement in LLM-as-a-judge systems indicates reliable evaluation. It finds that consensus is often illusory due to LLMs relying on shared surface heuristics rather than substantive quality. Through a large-scale study, the researchers show that model-level agreement masks fragile sample-level agreement, and that high-quality outputs receive the least consistent evaluations. The study also demonstrates that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessments, with agreement increasing in codified domains and decreasing in subjective ones.
研究挑战了LLM-as-a-judge系统中高评价者一致性意味着可靠评价的假设。研究识别了'评价幻象'现象,即LLM生成复杂的批评但基于共享的表面特征评分。通过大规模研究,研究人员发现模型级一致性掩盖了样本级的脆弱一致性,并且高质量输出反而获得最不一致的评价。研究还引入了MERG(元认知增强评分表生成框架),该框架显示,在知识锚定评价者共享标准的编码领域中,一致性增加,而在主观领域中,真正的评价多元性出现,一致性减少。这表明评价评分表应该丰富专家知识,而不是依赖通用标准。
Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style
Authors: Marvin Limpijankit, Milad Alshomary, Yassin Oulad Daoud, Amith Ananthram, Tim Trombley, Elias Stengel-Eskin, Mohit Bansal, Noam M. Elcott, Kathleen McKeown
First: 2026-03-11T17:49:45+00:00 · Latest: 2026-03-11T17:49:45+00:00
Comments: 12 pages, 12 figures
Abstract
VLMs have become increasingly proficient at a range of computer vision tasks, such as visual question answering and object detection. This includes increasingly strong capabilities in the domain of art, from analyzing artwork to generation of art. In an interdisciplinary collaboration between computer scientists and art historians, we characterize the mechanisms underlying VLMs' ability to predict artistic style and assess the extent to which they align with the criteria art historians use to reason about artistic style. We employ a latent-space decomposition approach to identify concepts that drive art style prediction and conduct quantitative evaluations, causal analysis and assessment by art historians. Our findings indicate that 73% of the extracted concepts are judged by art historians to exhibit a coherent and semantically meaningful visual feature and 90% of concepts used to predict style of a given artwork were judged relevant. In cases where an irrelevant concept was used to successfully predict style, art historians identified possible reasons for its success; for example, the model might "understand" a concept in more formal terms, such as dark/light contrasts.
中文标题/摘要
标题:AI能否像艺术史家一样看画?视觉语言模型如何识别艺术风格的解读
视觉语言模型(VLMs)在一系列计算机视觉任务中变得越来越熟练,包括视觉问答和物体检测。这包括在艺术领域的强大能力,从分析艺术品到生成艺术。在计算机科学家与艺术史家的跨学科合作中,我们描述了VLMs预测艺术风格的机制,并评估了它们与艺术史家用来推理艺术风格的标准的一致性程度。我们采用潜在空间分解方法来识别驱动艺术风格预测的概念,并进行了定量评估、因果分析和艺术史家的评估。我们的研究发现,73%提取的概念被认为由艺术史家判断具有连贯且语义上有意义的视觉特征,90%用于预测特定艺术品风格的概念被认为相关。在使用无关概念成功预测风格的情况下,艺术史家指出了可能的原因;例如,模型可能“理解”概念在更形式化的层面,如明暗对比。
Leech Lattice Vector Quantization for Efficient LLM Compression
Authors: Tycho F. A. van der Ouderaa, Mart van Baalen, Paul Whatmough, Markus Nagel
First: 2026-03-11T17:48:45+00:00 · Latest: 2026-03-11T17:48:45+00:00
Abstract
Scalar quantization of large language models (LLMs) is fundamentally limited by information-theoretic bounds. While vector quantization (VQ) overcomes these limits by encoding blocks of parameters jointly, practical implementations must avoid the need for expensive lookup mechanisms or other explicit codebook storage. Lattice approaches address this through highly structured and dense packing. This paper explores the Leech lattice, which, with its optimal sphere packing and kissing configurations at 24 dimensions, is the highest dimensional lattice known with such optimal properties. To make the Leech lattice usable for LLM quantization, we extend an existing search algorithm based on the extended Golay code construction, to i) support indexing, enabling conversion to and from bitstrings without materializing the codebook, ii) allow angular search over union of Leech lattice shells, iii) propose fully-parallelisable dequantization kernel. Together this yields a practical algorithm, namely Leech Lattice Vector Quantization (LLVQ). LLVQ delivers state-of-the-art LLM quantization performance, outperforming recent methods such as Quip\#, QTIP, and PVQ. These results highlight the importance of high-dimensional lattices for scalable, theoretically grounded model compression.
中文标题/摘要
标题:Leech 棱子格矢量量化用于高效的大语言模型压缩
大语言模型(LLMs)的标量量化受到信息论界限的限制。虽然向量量化(VQ)通过联合编码参数块来克服这些限制,但实际实现必须避免昂贵的查找机制或其他显式码本存储的需求。棱子方法通过高度结构化和密集的打包来解决这一问题。本文探讨了 Leech 棱子,这是一种在 24 维空间中具有最优球体填充和接吻配置的已知最高维度棱子,具有最优属性。为了使 Leech 棱子适用于 LLM 量化,我们扩展了一个基于扩展的戈莱码构造的现有搜索算法,以 i) 支持索引,使无需生成码本即可进行二进制字符串的转换,ii) 允许在 Leech 棱子壳的并集中进行角度搜索,iii) 提出完全并行化的去量化内核。这些共同构成了一个实用的算法,即 Leech 棱子向量量化(LLVQ)。LLVQ 在大语言模型量化性能方面达到了最先进的水平,优于最近的方法如 Quip\#、QTIP 和 PVQ。这些结果突显了高维棱子在可扩展和理论依据的模型压缩中的重要性。
Summary / 总结
This paper addresses the limitations of scalar quantization for large language models (LLMs) by proposing Leech Lattice Vector Quantization (LLVQ), which uses the Leech lattice to encode parameter blocks jointly. The method extends an existing search algorithm to support indexing and angular search, and introduces a fully-parallelisable dequantization kernel. LLVQ outperforms recent methods like Quip\#, QTIP, and PVQ, demonstrating the effectiveness of high-dimensional lattices in LLM compression.
该论文通过提出使用Leech晶格的Leech Lattice Vector Quantization (LLVQ)方法来解决大型语言模型(LLMs)的标量化限制问题。该方法扩展了现有的搜索算法,支持索引和角度搜索,并提出了一种完全并行化的解量化内核。实验结果表明,LLVQ在LLM压缩性能上优于Quip\#、QTIP和PVQ等最近的方法,展示了高维晶格在模型压缩中的有效性。
SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking
Authors: Muhammad Saif Ullah Khan, Didier Stricker
First: 2026-02-24T11:31:20+00:00 · Latest: 2026-03-11T17:44:41+00:00
Comments: Camera-ready version
Abstract
Modeling spinal motion is fundamental to understanding human biomechanics, yet remains underexplored in computer vision due to the spine's complex multi-joint kinematics and the lack of large-scale 3D annotations. We present a biomechanics-aware keypoint simulation framework that augments existing human pose datasets with anatomically consistent 3D spinal keypoints derived from musculoskeletal modeling. Using this framework, we create the first open dataset, named SIMSPINE, which provides sparse vertebra-level 3D spinal annotations for natural full-body motions in indoor multi-camera capture without external restraints. With 2.14 million frames, this enables data-driven learning of vertebral kinematics from subtle posture variations and bridges the gap between musculoskeletal simulation and computer vision. In addition, we release pretrained baselines covering fine-tuned 2D detectors, monocular 3D pose lifting models, and multi-view reconstruction pipelines, establishing a unified benchmark for biomechanically valid spine motion estimation. Specifically, our 2D spine baselines improve the state-of-the-art from 0.63 to 0.80 AUC in controlled environments, and from 0.91 to 0.93 AP for in-the-wild spine tracking. Together, the simulation framework and SIMSPINE dataset advance research in vision-based biomechanics, motion analysis, and digital human modeling by enabling reproducible, anatomically grounded 3D spine estimation under natural conditions.
中文标题/摘要
标题:SIMSPINE:一种生物力学感知的3D脊柱运动注释和基准框架
脊柱运动建模是理解人类生物力学的基础,但由于脊柱复杂的多关节运动学以及缺乏大规模3D注释,这一领域在计算机视觉中的研究仍然不足。我们提出了一种生物力学感知的关键点模拟框架,该框架利用肌肉骨骼建模从解剖学上一致地为现有的人体姿态数据集添加3D脊柱关键点。利用该框架,我们创建了首个开放数据集SIMSPINE,该数据集提供了自然全身运动在室内多摄像头捕捉下的稀疏椎体级别3D脊柱注释,无需外部约束。该数据集包含214万帧,使得可以从细微姿态变化中学习椎体运动学,并填补了肌肉骨骼模拟与计算机视觉之间的差距。此外,我们还发布了预训练基准模型,包括微调的2D检测器、单目3D姿态提升模型以及多视图重建流水线,为生物力学有效的脊柱运动估计建立了统一基准。具体而言,我们的2D脊柱基准模型在受控环境中将最先进的AUC从0.63提高到0.80,在野外脊柱跟踪中将AP从0.91提高到0.93。通过该模拟框架和SIMSPINE数据集,我们推进了基于视觉的生物力学、运动分析和数字人体建模研究,使其能够在自然条件下实现可重复的、解剖学基础的3D脊柱估计。
Summary / 总结
The research aims to enhance the understanding of spinal biomechanics through 3D spine motion annotation and benchmarking. The key method involves developing a biomechanics-aware simulation framework that adds anatomically consistent 3D spinal keypoints to existing human pose datasets. The main experimental findings include the creation of the SIMSPINE dataset with 2.14 million frames of vertebra-level 3D spinal annotations for natural full-body motions, and the release of pretrained baselines that improve spine motion estimation, particularly in controlled and in-the-wild environments.
研究旨在通过3D脊柱运动标注来增强对人体生物力学的理解,解决计算机视觉中缺乏此类数据的问题。关键方法是开发一种生物力学感知的模拟框架,将解剖上一致的3D脊椎关键点添加到现有的人体姿态数据集中。主要实验发现包括创建SIMSPINE数据集,包含2.14百万帧的椎体级3D脊椎标注,显著提高了脊柱运动估计,2D脊柱基线在受控环境中的AUC从0.63提高到0.80,在野外脊柱跟踪中的AP从0.91提高到0.93。
Geometric Scaling of Bayesian Inference in LLMs
Authors: Naman Agarwal, Siddhartha R. Dalal, Vishal Misra
First: 2025-12-27T05:29:55+00:00 · Latest: 2026-03-11T17:34:01+00:00
Comments: fixed bugg references
Abstract
Recent work has shown that small transformers trained in controlled "wind-tunnel'' settings can implement exact Bayesian inference, and that their training dynamics produce a geometric substrate -- low-dimensional value manifolds and progressively orthogonal keys -- that encodes posterior structure. We investigate whether this geometric signature persists in production-grade language models. Across Pythia, Phi-2, Llama-3, and Mistral families, we find that last-layer value representations organize along a single dominant axis whose position strongly correlates with predictive entropy, and that domain-restricted prompts collapse this structure into the same low-dimensional manifolds observed in synthetic settings.
To probe the role of this geometry, we perform targeted interventions on the entropy-aligned axis of Pythia-410M during in-context learning. Removing or perturbing this axis selectively disrupts the local uncertainty geometry, whereas matched random-axis interventions leave it intact. However, these single-layer manipulations do not produce proportionally specific degradation in Bayesian-like behavior, indicating that the geometry is a privileged readout of uncertainty rather than a singular computational bottleneck. Taken together, our results show that modern language models preserve the geometric substrate that enables Bayesian inference in wind tunnels, and organize their approximate Bayesian updates along this substrate.
中文标题/摘要
标题:几何缩放下的大规模语言模型中的贝叶斯推断
近期研究表明,在受控的“风洞”环境中训练的小型变压器可以实现精确的贝叶斯推断,并且其训练动力学产生了一种几何基底——低维价值流形和逐渐正交的键,这些编码了后验结构。我们研究这种几何特征是否在生产级语言模型中持续存在。在Pythia、Phi-2、Llama-3和Mistral系列中,我们发现最后一层的价值表示沿着一个主要轴组织,其位置与预测熵强烈相关,并且领域限制的提示将这种结构压缩到与合成环境中观察到的相同低维流形中。
为了探究这种几何特征的作用,我们在Pythia-410M的熵对齐轴上进行有针对性的干预,在上下文学习过程中。移除或扰动这条轴会选择性地破坏局部不确定性几何结构,而匹配的随机轴干预则不会影响它。然而,这些单层操作并没有产生与贝叶斯行为成比例的特定降解,表明几何结构是不确定性的一种特权读出,而不是单一的计算瓶颈。综上所述,我们的结果表明,现代语言模型保留了风洞中实现贝叶斯推断的几何基底,并沿着这条基底组织其近似贝叶斯更新。
Summary / 总结
This study investigates whether the geometric signature observed in small transformers trained for Bayesian inference persists in larger production-grade language models. Across various models like Pythia, Phi-2, Llama-3, and Mistral, the research finds that the last-layer value representations align along a dominant axis related to predictive entropy, and domain-restricted prompts collapse this structure into low-dimensional manifolds. Interventions on this entropy-aligned axis during in-context learning selectively disrupt the local uncertainty geometry, suggesting that this geometry is a privileged readout of uncertainty rather than a singular computational bottleneck.
研究探讨了在较小的变压器中观察到的几何特征是否在更大规模的生产级语言模型中仍然存在。研究发现,各种模型如Pythia、Phi-2、Llama-3和Mistral的最后一层值表示沿着与预测熵相关的主导轴排列,并且领域限制的提示将这种结构压缩成低维度的流形。对这种熵对齐轴的干预会选择性地破坏局部不确定性几何结构,表明这种几何结构是不确定性的一种特权读出,而不是计算瓶颈。
A Systematic Study of Pseudo-Relevance Feedback with LLMs
Authors: Nour Jedidi, Jimmy Lin
First: 2026-03-11T17:31:50+00:00 · Latest: 2026-03-11T17:31:50+00:00
Abstract
Pseudo-relevance feedback (PRF) methods built on large language models (LLMs) can be organized along two key design dimensions: the feedback source, which is where the feedback text is derived from and the feedback model, which is how the given feedback text is used to refine the query representation. However, the independent role that each dimension plays is unclear, as both are often entangled in empirical evaluations. In this paper, we address this gap by systematically studying how the choice of feedback source and feedback model impact PRF effectiveness through controlled experimentation. Across 13 low-resource BEIR tasks with five LLM PRF methods, our results show: (1) the choice of feedback model can play a critical role in PRF effectiveness; (2) feedback derived solely from LLM-generated text provides the most cost-effective solution; and (3) feedback derived from the corpus is most beneficial when utilizing candidate documents from a strong first-stage retriever. Together, our findings provide a better understanding of which elements in the PRF design space are most important.
中文标题/摘要
标题:大规模语言模型中的伪相关反馈系统研究
基于大规模语言模型(LLMs)的伪相关反馈(PRF)方法可以根据两个关键设计维度进行组织:反馈来源,即反馈文本的来源;以及反馈模型,即如何使用给定的反馈文本来细化查询表示。然而,这两个维度各自独立的作用尚不清楚,因为它们在实证评估中经常交织在一起。在本文中,我们通过受控实验系统地研究了反馈来源和反馈模型的选择如何影响PRF的有效性。在13个低资源BEIR任务和五种LLM PRF方法中,我们的结果显示:(1)反馈模型的选择在PRF的有效性中可能发挥关键作用;(2)仅从LLM生成的文本中提取的反馈提供了最具成本效益的解决方案;(3)当利用强第一阶段检索器的候选文档时,从语料库中提取的反馈最有益。综上所述,我们的发现为理解PRF设计空间中最重要元素提供了更好的理解。
Summary / 总结
This paper investigates the impact of feedback source and feedback model on pseudo-relevance feedback (PRF) effectiveness using large language models (LLMs). Across 13 low-resource BEIR tasks, the study finds that the feedback model significantly influences PRF effectiveness, with LLM-generated text being the most cost-effective feedback source. Additionally, corpus-derived feedback is most beneficial when used with a strong first-stage retriever. These findings offer insights into optimizing PRF design.
本文研究了反馈来源和反馈模型对基于大型语言模型(LLMs)的伪相关反馈(PRF)效果的影响。在13个低资源BEIR任务中,研究发现反馈模型显著影响PRF效果,LLM生成的文本是最具成本效益的反馈来源。此外,当与强大的第一阶段检索器结合时,来自语料库的反馈最有益。这些发现为优化PRF设计以获得更好的性能提供了见解。
Reinforced Generation of Combinatorial Structures: Ramsey Numbers
Authors: Ansh Nagda, Prabhakar Raghavan, Abhradeep Thakurta
First: 2026-03-10T04:20:40+00:00 · Latest: 2026-03-11T17:30:52+00:00
Abstract
We present improved lower bounds for five classical Ramsey numbers: $\mathbf{R}(3, 13)$ is increased from $60$ to $61$, $\mathbf{R}(3, 18)$ from $99$ to $100$, $\mathbf{R}(4, 13)$ from $138$ to $139$, $\mathbf{R}(4, 14)$ from $147$ to $148$, and $\mathbf{R}(4, 15)$ from $158$ to $159$. These results were achieved using AlphaEvolve, an LLM-based code mutation agent. Beyond these new results, we successfully recovered lower bounds for all Ramsey numbers known to be exact, and matched the best known lower bounds across many other cases. These include bounds for which previous work does not detail the algorithms used. Virtually all known Ramsey lower bounds are derived computationally, with bespoke search algorithms each delivering a handful of results. AlphaEvolve is a single meta-algorithm yielding search algorithms for all of our results.
Summary / 总结
This study improves lower bounds for five classical Ramsey numbers using AlphaEvolve, an LLM-based code mutation agent. The results include increasing $R(3, 13)$ to 61, $R(3, 18)$ to 100, $R(4, 13)$ to 139, $R(4, 14)$ to 148, and $R(4, 15)$ to 159. AlphaEvolve also successfully recovered lower bounds for all known exact Ramsey numbers and matched the best known lower bounds for many other cases.
研究旨在使用基于LLM的代码变异代理AlphaEvolve改进经典Ramsey数的下界。研究为五个Ramsey数设定了新的下界:$R(3, 13)$为$61$,$R(3, 18)$为$100$,$R(4, 13)$为$139$,$R(4, 14)$为$148$,$R(4, 15)$为$159$。AlphaEvolve还恢复并匹配了许多其他Ramsey数的已知下界,展示了其在计算Ramsey理论中生成搜索算法的有效性。
Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity
Authors: Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu, Wenjie Pei
First: 2026-03-10T10:31:58+00:00 · Latest: 2026-03-11T17:27:13+00:00
Comments: accepted by ICLR2026
Abstract
Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose PruneSID, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, PruneSID incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving 96.3% accuracy on LLaVA-1.5 with only 11.1% token retention, and 92.8% accuracy at extreme compression rates (5.6%) on LLaVA-NeXT, outperforming prior methods by 2.5% with 7.8 $\times$ faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility. Code is available at https://github.com/ZhengyaoFang/PruneSID.
中文标题/摘要
标题:去冗存精,协同重要性多样性:VLMs中的视觉标记压缩
视觉语言模型(VLMs)因视觉标记过度生成面临显著的计算效率问题。尽管先前工作表明大量视觉标记是冗余的,但现有压缩方法难以在重要性保存和信息多样性之间取得平衡。为解决这一问题,我们提出了一种名为PruneSID的无训练协同重要性多样性方法,其包含两阶段管道:(1)主语义成分分析(PSCA)用于将标记聚类为语义一致的组,确保全面的概念覆盖;(2)组内非最大抑制(NMS)用于去除冗余标记同时保留每个组内的关键代表性标记。此外,PruneSID还引入了一种基于图像复杂性的信息感知动态压缩比机制,根据图像复杂性优化标记压缩率,从而在多种场景中实现更有效的平均信息保存。大量实验表明,PruneSID在LLaVA-1.5上达到96.3%的准确率,仅保留11.1%的标记,并在LLaVA-NeXT上以5.6%的极端压缩率实现92.8%的准确率,相比先前方法提高了2.5%,且预填充速度比原模型快7.8倍。我们的框架适用于多种VLMs和图像、视频模态,展示了强大的跨模态通用性。代码可在https://github.com/ZhengyaoFang/PruneSID获取。
Summary / 总结
The paper addresses the computational inefficiencies in vision-language models (VLMs) due to redundant visual tokens. It introduces PruneSID, a training-free method that combines Principal Semantic Components Analysis (PSCA) for clustering tokens and Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key ones. This approach also includes an information-aware dynamic compression ratio mechanism. Experiments show that PruneSID achieves high accuracy even at extreme compression rates, outperforming previous methods with faster prefilling speed and broad applicability across different VLMs and modalities.
研究旨在解决由于冗余视觉标记引起的视觉语言模型(VLMs)的计算效率低下问题。PruneSID 是一种无需训练的方法,采用两阶段管道:主语义成分分析(PSCA)用于将标记聚类为语义一致的组,以及组内非最大抑制(NMS)用于去除冗余标记同时保留关键代表标记。此外,它还包括一种基于信息的动态压缩比率机制。实验表明,PruneSID 在LLaVA-1.5 上以 11.1% 的标记保留率实现了 96.3% 的准确性,在LLaVA-NeXT 上以 5.6% 的压缩率实现了 92.8% 的准确性,优于先前的方法,并且预填充速度更快。
Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds
Authors: Naman Agarwal, Siddhartha R. Dalal, Vishal Misra
First: 2025-12-27T05:31:44+00:00 · Latest: 2026-03-11T17:25:35+00:00
Comments: fixed buggy references
Abstract
Transformers empirically perform precise probabilistic reasoning in carefully constructed ``Bayesian wind tunnels'' and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training reshapes attention scores and value vectors in a transformer attention head. Our core result is an \emph{advantage-based routing law} for attention scores, \[ \frac{\partial L}{\partial s_{ij}} = α_{ij}\bigl(b_{ij}-\mathbb{E}_{α_i}[b]\bigr), \qquad b_{ij} := u_i^\top v_j, \] coupled with a \emph{responsibility-weighted update} for values, \[ Δv_j = -η\sum_i α_{ij} u_i, \] where $u_i$ is the upstream gradient at position $i$ and $α_{ij}$ are attention weights. These equations induce a positive feedback loop in which routing and content specialize together: queries route more strongly to values that are above-average for their error signal, and those values are pulled toward the queries that use them. We show that this coupled specialization behaves like a two-timescale EM procedure: attention weights implement an E-step (soft responsibilities), while values implement an M-step (responsibility-weighted prototype updates), with queries and keys adjusting the hypothesis frame. Through controlled simulations, including a sticky Markov-chain task where we compare a closed-form EM-style update to standard SGD, we demonstrate that the same gradient dynamics that minimize cross-entropy also sculpt the low-dimensional manifolds identified in our companion work as implementing Bayesian inference. This yields a unified picture in which optimization (gradient flow) gives rise to geometry (Bayesian manifolds), which in turn supports function (in-context probabilistic reasoning).
Summary / 总结
The paper investigates how gradient-based learning shapes attention mechanisms in transformers, providing a first-order analysis of cross-entropy training. Key findings include an advantage-based routing law for attention scores and a responsibility-weighted update for values, which together induce a positive feedback loop where routing and content specialize together. Experimental results show that these dynamics align with a two-timescale EM procedure and support Bayesian inference, leading to a unified picture where optimization, geometry, and function are interconnected in transformers.
论文旨在阐明梯度学习如何通过交叉熵训练塑造变压器注意力头的内部几何结构。作者推导出注意力分数的路由定律和值的责任加权更新,表明这些动态形成了一个正反馈循环,使得路由和内容共同专业化。通过模拟,作者证明这些梯度动态不仅最小化了交叉熵,还塑造了在伴侣工作中识别的低维流形,以实现贝叶斯推理,从而提供了一个统一的图景,即优化(梯度流动)、几何(贝叶斯流形)和功能(上下文中的概率推理)之间的相互关系。
Cross-Species Transfer Learning for Electrophysiology-to-Transcriptomics Mapping in Cortical GABAergic Interneurons
Authors: Theo Schwider, Ramin Ramezani
First: 2026-03-11T17:23:54+00:00 · Latest: 2026-03-11T17:23:54+00:00
Abstract
Single-cell electrophysiological recordings provide a powerful window into neuronal functional diversity and offer an interpretable route for linking intrinsic physiology to transcriptomic identity. Here, we replicate and extend the electrophysiology-to-transcriptomics framework introduced by Gouwens et al. (2020) using publicly available Allen Institute Patch-seq datasets from both mouse and human cortex. We focus on GABAergic inhibitory interneurons to target a subclass structure (Lamp5, Pvalb, Sst, Vip) that is comparable and conserved across species. After quality control, we analyzed 3,699 mouse visual cortex neurons and 506 human neocortical neurons from neurosurgical resections. Using standardized electrophysiological features and sparse PCA, we reproduced the major class-level separations reported in the original mouse study. For supervised prediction, a class-balanced random forest provided a strong feature-engineered baseline in mouse data and a reduced but still informative baseline in human data. We then developed an attention-based BiLSTM that operates directly on the structured IPFX feature-family representation, avoiding sPCA and providing feature-family-level interpretability via learned attention weights. Finally, we evaluated a cross-species transfer setting in which the sequence model is pretrained on mouse data and fine-tuned on human data for an aligned 4-class task, improving human macro-F1 relative to a human-only training baseline. Together, these results confirm reproducibility of the Gouwens pipeline in mouse data, demonstrate that sequence models can match feature-engineered baselines, and show that mouse-to-human transfer learning can provide measurable gains for human subclass prediction.
中文标题/摘要
标题:跨物种迁移学习在皮层GABA能抑制性中间神经元电生理学至转录组学映射中的应用
单细胞电生理记录为探索神经元功能多样性提供了一个强大的窗口,并为将内在生理学与转录组身份联系起来提供了一条可解释的途径。在这里,我们使用Allen Institute公开发布的来自小鼠和人类皮层的Patch-seq数据集,复制并扩展了Gouwens等人(2020年)引入的电生理学至转录组学框架。我们专注于GABA能抑制性中间神经元,以针对一个在物种间具有可比性和保守性的亚类结构(Lamp5、Pvalb、Sst、Vip)。经过质量控制后,我们分析了3,699个小鼠视觉皮层神经元和506个人类新皮层神经元(来自神经外科切除)。使用标准化的电生理学特征和稀疏主成分分析,我们在小鼠数据中重现了原始研究中报告的主要类级分离。对于监督预测,平衡的随机森林在小鼠数据中提供了强大的特征工程基线,在人类数据中则提供了减少但仍具有信息性的基线。然后,我们开发了一种基于注意力的双向LSTM,该模型直接作用于结构化的IPFX特征家族表示,避免了主成分分析,并通过学习到的注意力权重提供了特征家族级别的可解释性。最后,我们评估了一种跨物种迁移学习设置,在该设置中,序列模型在小鼠数据上预训练,然后在人类数据上微调以执行对齐的4类任务,相对于仅使用人类数据训练的基线,提高了人类宏F1值。总之,这些结果确认了Gouwens管道在小鼠数据中的可再现性,证明了序列模型可以匹配特征工程基线,并展示了小鼠到人类的迁移学习可以为人类亚类预测提供可测量的增益。
Summary / 总结
This study aims to replicate and extend the electrophysiology-to-transcriptomics framework using Allen Institute Patch-seq datasets from both mouse and human cortex, focusing on GABAergic interneurons. The researchers analyzed 3,699 mouse and 506 human neurons, reproducing class-level separations in mouse data and developing an attention-based BiLSTM for human data. The cross-species transfer learning approach improved human macro-F1 scores, demonstrating the potential for mouse-to-human transfer learning in subclass prediction.
本研究旨在使用Allen Institute Patch-seq数据集,从鼠和人皮层中重复并扩展电生理学至转录组学框架,重点关注GABA能抑制性中间神经元。研究人员分析了3,699个鼠神经元和506个人神经元,重现了鼠数据中的类级分离,并开发了基于注意力的BiLSTM以实现特征家族级别的可解释性。他们还展示了鼠训练的序列模型在人数据上的微调可以提高宏F1分数,表明跨物种迁移学习在亚类预测中的潜在价值。
The Bayesian Geometry of Transformer Attention
Authors: Naman Agarwal, Siddhartha R. Dalal, Vishal Misra
First: 2025-12-27T05:28:58+00:00 · Latest: 2026-03-11T17:22:40+00:00
Comments: fixed buggy references
Abstract
Transformers often appear to perform Bayesian reasoning in context, but verifying this rigorously has been impossible: natural data lack analytic posteriors, and large models conflate reasoning with memorization. We address this by constructing \emph{Bayesian wind tunnels} -- controlled environments where the true posterior is known in closed form and memorization is provably impossible. In these settings, small transformers reproduce Bayesian posteriors with $10^{-3}$-$10^{-4}$ bit accuracy, while capacity-matched MLPs fail by orders of magnitude, establishing a clear architectural separation.
Across two tasks -- bijection elimination and Hidden Markov Model (HMM) state tracking -- we find that transformers implement Bayesian inference through a consistent geometric mechanism: residual streams serve as the belief substrate, feed-forward networks perform the posterior update, and attention provides content-addressable routing. Geometric diagnostics reveal orthogonal key bases, progressive query-key alignment, and a low-dimensional value manifold parameterized by posterior entropy. During training this manifold unfurls while attention patterns remain stable, a \emph{frame-precision dissociation} predicted by recent gradient analyses.
Taken together, these results demonstrate that hierarchical attention realizes Bayesian inference by geometric design, explaining both the necessity of attention and the failure of flat architectures. Bayesian wind tunnels provide a foundation for mechanistically connecting small, verifiable systems to reasoning phenomena observed in large language models.
中文标题/摘要
标题:变换器注意力的贝叶斯几何
变换器经常表现出进行贝叶斯推理的能力,但严格验证这一点一直不可能:自然数据缺乏解析后验,而大型模型则将推理与记忆混淆在一起。我们通过构建‘贝叶斯风洞’——一种控制环境,其中真正的后验以闭合形式已知且记忆是可证明不可能的,来解决这一问题。在这些环境中,小型变换器以10^-3至10^-4比特的精度再现了贝叶斯后验,而具有相同容量的MLP则失败了几个数量级,从而建立了明显的架构分离。
在两个任务——映射消除和隐藏马尔可夫模型(HMM)状态跟踪——中,我们发现变换器通过一致的几何机制实现贝叶斯推理:残差流作为信念载体,前馈网络执行后验更新,而注意力提供内容可寻址路由。几何诊断揭示了正交键基、渐进查询-键对齐以及由后验熵参数化的低维值流形。在训练过程中,流形展开而注意力模式保持稳定,这与最近的梯度分析预测的‘框架-精度分离’一致。
总体而言,这些结果表明,分层注意力通过几何设计实现了贝叶斯推理,解释了注意力的必要性和平面架构的失败。贝叶斯风洞为从小型可验证系统机械地连接到大型语言模型中观察到的推理现象提供了基础。
Summary / 总结
The paper addresses the Bayesian reasoning capability of transformers by introducing Bayesian wind tunnels, environments where the true posterior is known. Small transformers accurately reproduce Bayesian posteriors with high precision, while MLPs fail significantly. Transformers implement Bayesian inference through residual streams, feed-forward networks, and attention, with geometric diagnostics revealing specific patterns during training. This study demonstrates that hierarchical attention is essential for Bayesian inference, explaining the success of transformers and the limitations of flat architectures.
研究旨在通过构建‘贝叶斯风洞’来严格验证变压器是否进行贝叶斯推理,其中真后验已知。研究发现,小型变压器可以以高精度重现贝叶斯后验,而MLP则失败。变压器通过一种一致的几何机制实现贝叶斯推理,涉及残差流、前馈网络和注意力。在训练过程中,值流形参数化后验熵并展开,而注意力模式保持稳定,这一现象由最近的梯度分析预测。这项工作表明,分层注意力对于实现贝叶斯推理是必要的,解释了为什么平面架构会失败。
Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity
Authors: Zhengyao Fang, Zexi Jia, Yijia Zhong, Pengcheng Luo, Jinchao Zhang, Guangming Lu, Jun Yu, Wenjie Pei
First: 2026-03-11T17:18:12+00:00 · Latest: 2026-03-11T17:18:12+00:00
Comments: accepted by CVPR2026
Abstract
Recent advances in text-to-image (T2I) generation have greatly improved visual quality, yet producing images that appear visually authentic to real-world photography remains challenging. This is partly due to biases in existing evaluation paradigms: human ratings and preference-trained metrics often favor visually vivid images with exaggerated saturation and contrast, which make generations often too vivid to be real even when prompted for realistic-style images. To address this issue, we present Color Fidelity Dataset (CFD) and Color Fidelity Metric (CFM) for objective evaluation of color fidelity in realistic-style generations. CFD contains over 1.3M real and synthetic images with ordered levels of color realism, while CFM employs a multimodal encoder to learn perceptual color fidelity. In addition, we propose a training-free Color Fidelity Refinement (CFR) that adaptively modulates spatial-temporal guidance scale in generation, thereby enhancing color authenticity. Together, CFD supports CFM for assessment, whose learned attention further guides CFR to refine T2I fidelity, forming a progressive framework for assessing and improving color fidelity in realistic-style T2I generation. The dataset and code are available at https://github.com/ZhengyaoFang/CFM.
中文标题/摘要
标题:太生动以至于不真实?生成色彩保真度的基准测试与校准
近年来,文本到图像(T2I)生成技术在视觉质量方面取得了显著进步,但生成出与现实世界摄影看起来真实的图像仍然具有挑战性。这在一定程度上是由于现有评估范式的偏见:人类评分和偏好训练的度量标准往往偏好视觉上生动、饱和度和对比度夸张的图像,即使在要求生成现实风格图像时,生成的图像也往往过于生动而不真实。为了解决这一问题,我们提出了色彩保真度数据集(CFD)和色彩保真度度量(CFM),用于客观评估现实风格生成中的色彩保真度。CFD包含超过130万张真实和合成图像,具有不同程度的色彩现实性,而CFM采用多模态编码器学习感知色彩保真度。此外,我们提出了一种无需训练的色彩保真度精炼(CFR),它能够自适应地调节生成中的空间-时间指导尺度,从而增强色彩的真实性。结合使用,CFD支持CFM进行评估,其学习到的注意力进一步引导CFR精炼T2I保真度,形成一个逐步框架,用于评估和改进现实风格T2I生成中的色彩保真度。数据集和代码可在https://github.com/ZhengyaoFang/CFM/获取。
Summary / 总结
This paper addresses the challenge of generating images that appear visually authentic to real-world photography, which is difficult due to biases in existing evaluation methods. It introduces the Color Fidelity Dataset (CFD) and the Color Fidelity Metric (CFM) for objective evaluation of color fidelity in realistic-style images. CFD consists of over 1.3 million real and synthetic images with varying levels of color realism, while CFM uses a multimodal encoder to learn perceptual color fidelity. Additionally, a training-free Color Fidelity Refinement (CFR) method is proposed to enhance color authenticity by adaptively adjusting the spatial-temporal guidance scale during generation. This forms a progressive framework for assessing and improving color fidelity in text-to-image generation.
论文旨在解决生成图像以真实世界摄影视觉真实感为目标的挑战,现有评估方法中的偏见使得这一目标难以实现。为此,提出了颜色保真度数据集(CFD)和颜色保真度度量(CFM),用于客观评估现实风格生成中的颜色保真度。CFD 包含超过 130 万张真实和合成图像,具有不同程度的颜色现实性,而 CFM 利用多模态编码器学习感知颜色保真度。此外,还提出了一种无需训练的颜色保真度精炼(CFR)方法,以增强生成图像中的颜色真实性。这形成了一个逐步框架,用于评估和改进现实风格文本到图像生成中的颜色保真度。
Learning What Reinforcement Learning Can't: Interleaved Online Fine-Tuning for Hardest Questions
Authors: Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, Wentao Zhang
Venue: ICLR 2026
First: 2025-06-09T08:11:20+00:00 · Latest: 2026-03-11T17:08:49+00:00
Comments: 12 pages, 5 figures
Abstract
Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model's original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, \textbf{ReLIFT} (\textbf{Re}inforcement \textbf{L}earning \textbf{I}nterleaved with Online \textbf{F}ine-\textbf{T}uning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model's reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13\% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.
中文标题/摘要
标题:学习强化学习无法掌握的内容:最难问题的交错在线微调
大型语言模型(LLM)推理的最新进展表明,诸如规划和自我反思等复杂行为可以通过强化学习(RL)涌现出来。然而,尽管取得了这些成功,当前形式的RL仍然不足以诱导超出基模型限制的能力,因为它主要基于模型现有知识进行优化,而不是促进新信息的获取。为了解决这一局限性,我们采用监督微调(SFT)来学习RL无法掌握的内容,通过利用高质量的演示数据,使模型能够吸收新知识和推理模式。我们分析了RL和SFT在LLM推理中的训练动态,发现RL在保持和提高模型原有能力范围内的问题性能方面表现出色,而SFT则更有效地使模型能够解决超出当前模型范围的问题。受RL和SFT互补优势的启发,我们提出了一种新的训练方法——ReLIFT(Reinforcement Learning Interleaved with Online Fine-Tuning)。在ReLIFT中,模型主要使用RL进行训练,但在遇到难题时,收集高质量的解决方案进行微调,并交替进行RL和微调训练,以增强模型的推理能力。与其它零RL模型相比,ReLIFT在五个竞赛级别基准和一个离分布基准上平均提高了超过5.2分。此外,我们证明ReLIFT仅使用13%的详细演示数据就能超越RL和SFT,突显了其可扩展性。这些结果提供了有力的证据,表明ReLIFT克服了RL的基本局限性,并强调了其巨大的潜力。
GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations
Authors: Boyuan Chen, Minghao Shao, Siddharth Garg, Ramesh Karri, Muhammad Shafique
First: 2026-03-11T17:04:30+00:00 · Latest: 2026-03-11T17:04:30+00:00
Abstract
Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a critical component, being beneficial for stronger models but detrimental for weaker ones. Confidence scores, by contrast, introduce noise for most architectures and their removal improves performance in four of five evaluated models. We further evaluate feature-level fusion architectures, finding that explicit symbolic grounding via structured prompts outperforms implicit feature fusion despite sophisticated cross-attention mechanisms. Our approach yields consistent improvements across four of five evaluated VLM architectures (6.2--7.5pp), with one architecture exhibiting degraded performance due to incompatibility between its iterative reflection mechanisms and structured prompts. These results suggest that counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, while highlighting the importance of architectural compatibility in augmentation strategies.
中文标题/摘要
标题:GroundCount:通过对象检测实现视觉-语言模型的空间定位以减轻计数幻觉
视觉语言模型(VLMs)在计数任务中表现出持续的幻觉现象,准确率远低于其他视觉推理任务(不包括情感分析)。这一现象在最先进的具有推理能力的VLMs中依然存在。相反,基于CNN的对象检测模型(ODMs)如YOLO在空间定位和实例计数方面表现出色,且计算开销较小。我们提出了一种名为GroundCount的框架,该框架通过从ODMs引入显式的空间定位来增强VLMs,以减轻计数幻觉。在最佳情况下,我们的基于提示的增强策略在性能最佳的模型(Ovis2.5-2B)上实现了81.3%的计数准确率,比基线提高了6.6个百分点,同时通过消除幻觉驱动的推理循环将推理时间减少了22%。我们进行了全面的消融研究,表明位置编码是关键组件,对强模型有利但对弱模型不利。相比之下,置信度分数对大多数架构引入了噪声,其移除在四个模型中提高了性能。我们进一步评估了特征级融合架构,发现通过结构化提示实现的显式符号定位优于隐式特征融合,尽管具有复杂的跨注意力机制。我们的方法在四个模型中实现了一致的改进(6.2-7.5个百分点),其中一个模型由于其迭代反射机制与结构化提示不兼容而表现出性能下降。这些结果表明,计数失败的根本原因在于空间语义整合的局限性,而不是特定架构的缺陷,同时强调了增强策略中架构兼容性的重要性。
Summary / 总结
The paper addresses the issue of counting hallucinations in Vision Language Models (VLMs) by proposing GroundCount, a framework that integrates object detection models (ODMs) to enhance spatial grounding. This method improves counting accuracy by 6.6 percentage points on the best-performing model (Ovis2.5-2B) to 81.3%, while reducing inference time by 22%. Ablation studies show that positional encoding is crucial for stronger models but not for weaker ones, and confidence scores generally introduce noise. Feature-level fusion architectures did not outperform explicit symbolic grounding via structured prompts, which consistently improved four out of five VLM architectures by 6.2 to 7.5 percentage points.
论文通过提出GroundCount框架,将对象检测模型(ODMs)与视觉语言模型(VLMs)结合,以增强空间定位,从而解决计数幻觉问题。该方法在Ovis2.5-2B模型上将计数准确性提高了6.6个百分点至81.3%,同时减少了22%的推理时间。研究还发现,位置编码对较强模型至关重要,但对较弱模型可能有害;移除置信度分数通常会提高性能。特征级融合架构并未超越GroundCount中使用的结构化提示。结果表明,计数失败主要是由于空间语义整合限制,而不是特定架构缺陷所致。
ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging
Authors: Athanasios Angelakis
First: 2026-02-20T01:38:59+00:00 · Latest: 2026-03-11T16:56:59+00:00
Comments: 24 pages, 15 figures, 5 tables. Code and models available at https://github.com/Bluesman79/ZACH-ViT
Abstract
Vision Transformers rely on positional embeddings and class tokens encoding fixed spatial priors. While effective for natural images, these priors may be suboptimal when spatial layout is weakly informative, a frequent condition in medical imaging. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer that removes positional embeddings and the [CLS] token, achieving permutation-invariant patch processing via global average pooling. Zero-token denotes removal of the dedicated aggregation token and positional encodings. Patch tokens remain unchanged. Adaptive residual projections preserve training stability under strict parameter constraints. We evaluate ZACH-ViT across seven MedMNIST datasets under a strict few-shot protocol (50 samples/class, fixed hyperparameters, five seeds). Results reveal regime-dependent behavior: ZACH-ViT (0.25M parameters, trained from scratch) achieves strongest advantage on BloodMNIST and remains competitive on PathMNIST, while relative advantage decreases on datasets with stronger anatomical priors (OCTMNIST, OrganAMNIST), consistent with our hypothesis. Component and pooling ablations show positional support becomes mildly beneficial as spatial structure increases, whereas reintroducing a [CLS] token is consistently unfavorable. These findings support that architectural alignment with data structure can outweigh universal benchmark dominance. Despite minimal size and no pretraining, ZACH-ViT achieves competitive performance under data-scarce conditions, relevant for compact medical imaging and low-resource settings. Code: https://github.com/Bluesman79/ZACH-ViT
中文标题/摘要
标题:ZACH-ViT:紧凑型视觉变换器在医学成像中的依赖于范式的归纳偏见
视觉变换器依赖于位置嵌入和类标记编码固定的空间先验。虽然对于自然图像有效,但在医学成像中,由于空间布局信息较弱,这些先验可能并不理想。我们引入了ZACH-ViT(Zero-token Adaptive Compact Hierarchical Vision Transformer),这是一种紧凑型视觉变换器,去除了位置嵌入和[CLS]标记,通过全局平均池化实现不变的块处理。Zero-token表示去除了专用聚合标记和位置编码,块标记保持不变。自适应残差投影在严格参数约束下保持训练稳定性。我们在严格的少量样本协议(每类50个样本,固定超参数,五个种子)下,对ZACH-ViT在七个MedMNIST数据集上进行了评估。结果表明,ZACH-ViT(0.25M参数,从头开始训练)在BloodMNIST上表现出最强的优势,并在PathMNIST上保持竞争力,但在具有更强解剖先验的数据集(OCTMNIST,OrganAMNIST)上相对优势下降,这与我们的假设一致。组件和池化消融实验表明,随着空间结构的增加,位置支持变得略微有益,而重新引入[CLS]标记始终是不利的。这些发现支持了与数据结构的架构对齐可以超越通用基准主导地位的观点。尽管ZACH-ViT规模很小且未进行预训练,但在数据稀缺条件下仍能实现竞争力的性能,这在紧凑型医学成像和低资源环境中具有相关性。代码:https://github.com/Bluesman79/ZACH-ViT
Contact Coverage-Guided Exploration for General-Purpose Dexterous Manipulation
Authors: Zixuan Liu, Ruoyi Qiao, Chenrui Tie, Xuanwei Liu, Yunfan Lou, Chongkai Gao, Zhixuan Xu, Lin Shao
First: 2026-03-11T16:55:49+00:00 · Latest: 2026-03-11T16:55:49+00:00
Comments: 16 pages
Abstract
Deep Reinforcement learning (DRL) has achieved remarkable success in domains with well-defined reward structures, such as Atari games and locomotion. In contrast, dexterous manipulation lacks general-purpose reward formulations and typically depends on task-specific, handcrafted priors to guide hand-object interactions. We propose Contact Coverage-Guided Exploration (CCGE), a general exploration method designed for general-purpose dexterous manipulation tasks. CCGE represents contact state as the intersection between object surface points and predefined hand keypoints, encouraging dexterous hands to discover diverse and novel contact patterns, namely which fingers contact which object regions. It maintains a contact counter conditioned on discretized object states obtained via learned hash codes, capturing how frequently each finger interacts with different object regions. This counter is leveraged in two complementary ways: (1) to assign a count-based contact coverage reward that promotes exploration of novel contact patterns, and (2) an energy-based reaching reward that guides the agent toward under-explored contact regions. We evaluate CCGE on a diverse set of dexterous manipulation tasks, including cluttered object singulation, constrained object retrieval, in-hand reorientation, and bimanual manipulation. Experimental results show that CCGE substantially improves training efficiency and success rates over existing exploration methods, and that the contact patterns learned with CCGE transfer robustly to real-world robotic systems. Project page is https://contact-coverage-guided-exploration.github.io.
中文标题/摘要
标题:接触覆盖引导探索在通用灵巧操作中的应用
深度强化学习(DRL)在具有明确奖励结构的领域取得了显著成功,如Atari游戏和运动控制。相比之下,灵巧操作缺乏通用的奖励形式,通常依赖于特定任务的手工设计先验来指导手-物体交互。我们提出了一种名为接触覆盖引导探索(CCGE)的一般探索方法,旨在用于通用灵巧操作任务。CCGE将接触状态表示为物体表面点与预定义的手指关键点的交集,鼓励灵巧的手指发现多样且新颖的接触模式,即哪些手指接触哪些物体区域。它通过学习得到的哈希码离散化物体状态维护一个接触计数器,捕捉每个手指与不同物体区域交互的频率。该计数器以两种互补的方式利用:(1)基于计数的接触覆盖奖励,促进探索新颖的接触模式;(2)能量基的抓取奖励,引导智能体向未探索的接触区域移动。我们在包括杂乱物体分离、受限物体检索、手中重新定向和双臂操作在内的多种灵巧操作任务上评估了CCGE。实验结果表明,CCGE在训练效率和成功率上显著优于现有探索方法,并且使用CCGE学习到的接触模式能够稳健地转移到实际的机器人系统中。项目页面为https://contact-coverage-guided-exploration.github.io。
Summary / 总结
The research aims to address the challenge of general-purpose dexterous manipulation by proposing Contact Coverage-Guided Exploration (CCGE), which uses contact coverage and frequency of finger-object interactions to guide exploration. CCGE improves training efficiency and success rates in various manipulation tasks compared to existing methods, and the learned contact patterns transfer well to real-world robotic systems.
研究旨在通过改进深度强化学习来解决缺乏通用奖励形式的灵巧操作任务。提出的接触覆盖引导探索(CCGE)方法利用接触覆盖和手指-物体交互的频率来引导探索。CCGE在多种操作任务中提高了训练效率和成功率,并且学习到的接触模式能够很好地转移到实际机器人系统中。
TOSSS: a CVE-based Software Security Benchmark for Large Language Models
Authors: Marc Damie, Murat Bilgehan Ertan, Domenico Essoussi, Angela Makhanu, Gaëtan Peter, Roos Wensveen
First: 2026-03-11T16:54:01+00:00 · Latest: 2026-03-11T16:54:01+00:00
Abstract
With their increasing capabilities, Large Language Models (LLMs) are now used across many industries. They have become useful tools for software engineers and support a wide range of development tasks. As LLMs are increasingly used in software development workflows, a critical question arises: are LLMs good at software security? At the same time, organizations worldwide invest heavily in cybersecurity to reduce exposure to disruptive attacks. The integration of LLMs into software engineering workflows may introduce new vulnerabilities and weaken existing security efforts.
We introduce TOSSS (Two-Option Secure Snippet Selection), a benchmark that measures the ability of LLMs to choose between secure and vulnerable code snippets. Existing security benchmarks for LLMs cover only a limited range of vulnerabilities. In contrast, TOSSS relies on the CVE database and provides an extensible framework that can integrate newly disclosed vulnerabilities over time. Our benchmark gives each model a security score between 0 and 1 based on its behavior; a score of 1 indicates that the model always selects the secure snippet, while a score of 0 indicates that it always selects the vulnerable one. We evaluate 14 widely used open-source and closed-source models on C/C++ and Java code and observe scores ranging from 0.48 to 0.89. LLM providers already publish many benchmark scores for their models, and TOSSS could become a complementary security-focused score to include in these reports.
中文标题/摘要
标题:TOSSS:基于CVE的软件安全基准测试用于大型语言模型
随着其能力的不断增强,大型语言模型(LLMs)现在被广泛应用于许多行业。它们已成为软件工程师的有用工具,并支持广泛的开发任务。随着LLMs在软件开发工作流中的应用越来越广泛,一个关键问题出现了:LLMs在软件安全方面表现如何?与此同时,世界各国组织都在大力投资网络安全,以减少受到破坏性攻击的暴露。将LLMs集成到软件工程工作流中可能会引入新的漏洞,削弱现有的安全努力。
我们提出了TOSSS(Two-Option Secure Snippet Selection),这是一个基准测试,用于衡量LLMs在选择安全代码片段和易受攻击代码片段之间的能力。现有的针对LLMs的安全基准测试仅涵盖有限范围的漏洞。相比之下,TOSSS依赖于CVE数据库,并提供了一个可扩展的框架,可以随着时间的推移整合新披露的漏洞。我们的基准测试根据模型的行为给每个模型一个0到1之间的安全评分;得分为1表示模型总是选择安全的代码片段,得分为0表示它总是选择易受攻击的代码片段。我们在C/C++和Java代码上评估了14个广泛使用的开源和闭源模型,并观察到评分范围从0.48到0.89。LLMs提供商已经发布了许多模型的基准测试评分,TOSSS可以成为这些报告中的一个补充的安全重点评分。
Summary / 总结
TOSSS is a benchmark designed to evaluate Large Language Models (LLMs) on their ability to choose secure code snippets over vulnerable ones. It leverages the CVE database to ensure a wide range of security vulnerabilities are covered. Evaluating 14 LLMs on C/C++ and Java code, TOSSS scores ranged from 0.48 to 0.89, indicating varying levels of security awareness among these models. This benchmark could serve as a security-focused score to complement existing benchmark reports.
TOSSS 是一个基准,用于评估大型语言模型(LLMs)在选择安全代码片段方面的能力,而不是选择易受攻击的代码片段。它利用 CVE 数据库来确保涵盖广泛的漏洞。评估 14 个 LLMs 在 C/C++ 和 Java 代码上的表现,TOSSS 的得分范围从 0.48 到 0.89,表明这些模型在安全意识方面的差异。该基准可以作为 LLM 提供商在其报告中包含的补充安全重点得分。
Med-DualLoRA: Local Adaptation of Foundation Models for 3D Cardiac MRI
Authors: Joan Perramon-Llussà, Amelia Jiménez-Sánchez, Grzegorz Skorupko, Fotis Avgoustidis, Carlos Martín-Isla, Karim Lekadir, Polyxeni Gkontra
Venue: MICCAI 2026
First: 2026-03-11T16:52:21+00:00 · Latest: 2026-03-11T16:52:21+00:00
Comments: 11 pages, 2 figures. Submitted to MICCAI 2026
Abstract
Foundation models (FMs) show great promise for robust downstream performance across medical imaging tasks and modalities, including cardiac magnetic resonance (CMR), following task-specific adaptation. However, adaptation using single-site data may lead to suboptimal performance and increased model bias, while centralized fine-tuning on clinical data is often infeasible due to privacy constraints. Federated fine-tuning offers a privacy-preserving alternative; yet conventional approaches struggle under heterogeneous, non-IID multi-center data and incur substantial communication overhead when adapting large models. In this work, we study federated FM fine-tuning for 3D CMR disease detection and propose Med-DualLoRA, a client-aware parameter-efficient fine-tuning (PEFT) federated framework that disentangles globally shared and local low-rank adaptations (LoRA) through additive decomposition. Global and local LoRA modules are trained locally, but only the global component is shared and aggregated across sites, keeping local adapters private. This design improves personalization while significantly reducing communication cost, and experiments show that adapting only two transformer blocks preserves performance while further improving efficiency. We evaluate our method on a multi-center state-of-the-art cine 3D CMR FM fine-tuned for disease detection using ACDC and combined M\&Ms datasets, treating each vendor as a federated client. Med-DualLoRA achieves statistically significant improved performance (balanced accuracy 0.768, specificity 0.612) compared to other federated PEFT baselines, while maintaining communication efficiency. Our approach provides a scalable solution for local federated adaptation of medical FMs under realistic clinical constraints.
中文标题/摘要
标题:Med-DualLoRA:针对3D心脏MRI的医学基础模型局部适应
基础模型(FMs)在医疗成像任务和模态中表现出巨大的潜力,特别是在心脏磁共振(CMR)领域,经过特定任务的适应后,可以实现稳健的下游性能。然而,使用单一站点的数据进行适应可能导致性能不佳和模型偏差增加,而集中式微调由于隐私限制往往不可行。联邦微调提供了一种隐私保护的替代方案;然而,传统方法在异构、非IID多中心数据下表现不佳,并且在适应大型模型时会产生大量的通信开销。在本文中,我们研究了3D CMR疾病检测的联邦FM微调,并提出了一种Med-DualLoRA客户端感知的参数高效微调(PEFT)联邦框架,通过加性分解分离全局共享和局部低秩适应(LoRA)。全局和局部LoRA模块在本地训练,但仅共享和聚合全局组件,保持局部适配器的隐私。这种设计提高了个性化能力,同时显著降低了通信成本。实验表明,仅适应两个变压器块可以保持性能并进一步提高效率。我们在ACDC和联合M&M数据集上对多中心最先进的cine 3D CMR FM进行疾病检测微调,将每个供应商视为联邦客户端,评估了我们的方法。Med-DualLoRA在与其他联邦PEFT基线相比时,实现了统计上显著的性能提升(平衡准确率0.768,特异性0.612),同时保持了通信效率。我们的方法为在现实临床约束下提供了一种可扩展的医学FMs局部联邦适应解决方案。
Summary / 总结
This work addresses the challenge of adapting foundation models for 3D cardiac MRI disease detection in a federated learning setting, where centralized fine-tuning is impractical due to privacy constraints. The proposed Med-DualLoRA framework disentangles global and local low-rank adaptations, allowing only the global component to be shared, which improves personalization while reducing communication overhead. Experiments show that adapting only two transformer blocks preserves performance and further enhances efficiency, achieving statistically significant improvements in balanced accuracy and specificity compared to other federated PEFT baselines.
研究旨在通过联邦学习提高基础模型在3D心脏MRI疾病检测中的性能,同时解决隐私问题。Med-DualLoRA是一种参数高效微调框架,将全局和局部适应分离,仅共享全局组件。这种方法增强了个性化并减少了通信开销。实验表明,仅微调两个变压器块可以保持性能并提高效率,相比其他联邦PEFT基线方法,实现了更好的平衡准确率和特异性。
Ranking Reasoning LLMs under Test-Time Scaling
Authors: Mohsen Hariri, Michael Hinczewski, Jing Ma, Vipin Chaudhary
First: 2026-03-11T16:47:41+00:00 · Latest: 2026-03-11T16:47:41+00:00
Comments: Code is available at https://github.com/mohsenhariri/scorio
Abstract
Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across $20$ reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to $N=80$ trials), most full-trial rankings agree closely with the Bayesian gold standard $\mathrm{Bayes}_{\mathcal{U}}@80$ (mean Kendall's $τ_b = 0.93$--$0.95$), and $19$--$34$ methods recover exactly the same ordering. In the single-trial regime, the best methods reach $τ_b \approx 0.86$. Using greedy decoding as an empirical prior ($\mathrm{Bayes}_{\mathbf{R}_0}@N$) reduces variance at $N=1$ by $16$--$52\%$, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.
中文标题/摘要
标题:测试时缩放下逻辑推理大模型的排名推理
测试时缩放通过每条提示采样多个输出来评估逻辑推理大模型,但在这种模式下对模型进行排名仍较少被探索。我们形式化了测试时缩放下的密集基准排名,并引入了Scorio库,该库实现了配对比较模型、项目反应理论(IRT)模型、投票规则以及图和谱方法等统计排名方法。在最多$N=80$次试验的20个逻辑推理模型上,四个奥林匹克风格的数学基准(AIME'24、AIME'25、HMMT'25和BrUMO'25),大多数全试验排名与贝叶斯黄金标准$\mathrm{Bayes}_{\mathcal{U}}@80$(平均Kendall's $τ_b = 0.93$--$0.95$)高度一致,且19到34种方法完全恢复了相同的排序。在单次试验模式下,最佳方法达到$τ_b \approx 0.86$。使用贪婪解码作为经验先验($\mathrm{Bayes}_{\mathbf{R}_0}@N$)在$N=1$时可减少方差16%到52%,但当贪婪解码和随机采样结果不同时,可能会导致排名偏差。这些结果确定了适用于高预算和低预算测试时缩放的可靠排名方法。我们以开源库的形式发布了Scorio,可在https://github.com/mohsenhariri/scorio获取。
Summary / 总结
The paper addresses the challenge of ranking large language models (LLMs) under test-time scaling by formalizing dense benchmark ranking and introducing Scorio, a library with various statistical ranking methods. Across 20 reasoning models on four Olympiad-style math benchmarks, most full-trial rankings closely match the Bayesian gold standard, with Kendall's τ_b ranging from 0.93 to 0.95. In the single-trial regime, the best methods achieve τ_b around 0.86. Greedy decoding reduces variance but can introduce bias. The study provides reliable ranking methods for both high- and low-budget test-time scaling scenarios.
该研究通过每次提示采样多个输出来评估推理LLM,并引入了Scorio库,该库实现了统计排名方法。在四个奥林匹克风格的数学基准测试上,20个推理模型的大多数全试次排名与贝叶斯黄金标准高度一致,Kendall's τ_b范围在0.93到0.95之间。在单试次条件下,最佳方法的τ_b约为0.86。贪婪解码可以减少方差,但在与随机采样结果不一致时会偏排名。该研究提供了适用于高预算和低预算测试时间缩放的可靠排名方法。
CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance
Authors: Hanyang Wang, Yiyang Liu, Jiawei Chi, Fangfu Liu, Ran Xue, Yueqi Duan
Venue: CVPR 2026
First: 2026-03-03T18:59:48+00:00 · Latest: 2026-03-11T16:29:34+00:00
Comments: Accepted by CVPR 2026; Project Page: https://hanyang-21.github.io/CFG-Ctrl
Abstract
Classifier-Free Guidance (CFG) has emerged as a central approach for enhancing semantic alignment in flow-based diffusion models. In this paper, we explore a unified framework called CFG-Ctrl, which reinterprets CFG as a control applied to the first-order continuous-time generative flow, using the conditional-unconditional discrepancy as an error signal to adjust the velocity field. From this perspective, we summarize vanilla CFG as a proportional controller (P-control) with fixed gain, and typical follow-up variants develop extended control-law designs derived from it. However, existing methods mainly rely on linear control, inherently leading to instability, overshooting, and degraded semantic fidelity especially on large guidance scales. To address this, we introduce Sliding Mode Control CFG (SMC-CFG), which enforces the generative flow toward a rapidly convergent sliding manifold. Specifically, we define an exponential sliding mode surface over the semantic prediction error and introduce a switching control term to establish nonlinear feedback-guided correction. Moreover, we provide a Lyapunov stability analysis to theoretically support finite-time convergence. Experiments across text-to-image generation models including Stable Diffusion 3.5, Flux, and Qwen-Image demonstrate that SMC-CFG outperforms standard CFG in semantic alignment and enhances robustness across a wide range of guidance scales. Project Page: https://hanyang-21.github.io/CFG-Ctrl
中文标题/摘要
标题:CFG-Ctrl:基于控制的分类器自由扩散引导
分类器自由引导(CFG)已成为增强流基扩散模型语义对齐的核心方法。本文探讨了一种统一框架CFG-Ctrl,将CFG重新解释为对第一阶连续生成流的控制,并使用条件-非条件差异作为误差信号调整速度场。从这个角度来看,我们总结了传统的CFG为固定增益的比例控制器(P控制),而常见的后续变体则在此基础上发展了扩展的控制律设计。然而,现有方法主要依赖线性控制,这导致了不稳定性、超调和语义保真度下降,尤其是在大引导尺度下。为了解决这一问题,我们引入了滑模控制CFG(SMC-CFG),它强制生成流向快速收敛的滑动流形。具体而言,我们定义了语义预测误差的指数滑模表面,并引入了切换控制项以建立非线性反馈引导校正。此外,我们提供了李亚普诺夫稳定性分析以理论支持有限时间收敛。实验表明,SMC-CFG在语义对齐方面优于标准CFG,并且在广泛的引导尺度范围内增强了鲁棒性。项目页面:https://hanyang-21.github.io/CFG-Ctrl
Summary / 总结
The paper introduces CFG-Ctrl, a unified framework that reinterprets Classifier-Free Guidance (CFG) as a control applied to the first-order continuous-time generative flow. It addresses the instability issues of existing methods by proposing Sliding Mode Control CFG (SMC-CFG), which uses an exponential sliding mode surface and a switching control term to ensure finite-time convergence. Experiments show that SMC-CFG outperforms standard CFG in semantic alignment and robustness across various guidance scales in text-to-image generation models like Stable Diffusion 3.5, Flux, and Qwen-Image.
研究旨在通过控制方法提高流基扩散模型中的语义对齐。CFG-Ctrl框架将分类器无指导方法重新解释为对生成流的控制,并使用条件-无条件差异作为误差信号。实验表明,引入的滑模控制CFG(SMC-CFG)在各种指导尺度下比标准CFG在语义对齐和鲁棒性方面表现更优,解决了现有方法的不稳定性问题。
SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation
Authors: Ryan Shea, Yunan Lu, Liang Qiu, Zhou Yu
First: 2025-10-13T22:52:17+00:00 · Latest: 2026-03-11T16:26:47+00:00
Abstract
Evaluating multi-turn interactive agents is challenging due to the need for human assessment. Evaluation with simulated users has been introduced as an alternative, however existing approaches typically model generic users and overlook the domain-specific principles required to capture realistic behavior. We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts. SAGE incorporates top-down knowledge rooted in business logic, such as ideal customer profiles, grounding user behavior in realistic customer personas. We further integrate bottom-up knowledge taken from business agent infrastructure (e.g., product catalogs, FAQs, and knowledge bases), allowing the simulator to generate interactions that reflect users' information needs and expectations in a company's target market. Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support bug-finding and iterative agent improvement.
中文标题/摘要
标题:SAGE:一种基于知识的自上而下自下而上用户模拟器用于多轮智能体评估
多轮交互智能体的评估具有挑战性,因为需要人类评估。模拟用户进行评估已被引入作为替代方案,然而现有方法通常建模通用用户,忽略了捕捉现实行为所需的领域特定原则。我们提出了一种名为SAGE的新颖用户模拟框架,用于多轮智能体评估,该框架结合了来自商业背景的知识。SAGE整合了源自商业逻辑的自上而下的知识,如理想客户画像,使用户行为基于现实的客户角色。我们进一步整合了源自商业智能体基础设施的自下而上的知识(例如,产品目录、常见问题解答和知识库),使模拟器能够生成反映用户在公司目标市场中的信息需求和期望的交互。通过实证评估,我们发现这种方法生成的交互更加真实和多样化,同时还能识别出高达33%更多的智能体错误,突显了其作为支持错误查找和迭代智能体改进的评估工具的有效性。
Summary / 总结
The paper introduces SAGE, a user simulator for evaluating multi-turn interactive agents, which integrates top-down business logic and bottom-up domain-specific knowledge. This approach generates more realistic and diverse interactions, identifying up to 33% more agent errors compared to existing methods.
论文提出了SAGE,一种用于评估多轮交互代理的用户模拟框架。该方法结合了基于业务逻辑的自上而下的知识和来自业务代理基础设施的自下而上的知识,以生成更真实和多样的用户交互。该方法识别出的代理错误比现有方法多出33%以上,表明其作为支持错误查找和迭代代理改进的评估工具的有效性。
Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control
Authors: Yaswanth Chittepu, Ativ Joshi, Rajarshi Bhattacharjee, Scott Niekum
First: 2026-03-11T16:24:20+00:00 · Latest: 2026-03-11T16:24:20+00:00
Abstract
Safe Reinforcement Learning from Human Feedback (RLHF) typically enforces safety through expected cost constraints, but the expectation captures only a single statistic of the cost distribution and fails to account for distributional uncertainty, particularly under heavy tails or rare catastrophic events. This limitation is problematic when robustness and risk sensitivity are critical. Stochastic dominance offers a principled alternative by comparing entire cost distributions rather than just their averages, enabling direct control over tail risks and potential out-of-distribution failures that expectation-based constraints may overlook. In this work, we propose Risk-sensitive Alignment via Dominance (RAD), a novel alignment framework that replaces scalar expected cost constraints with First-Order Stochastic Dominance (FSD) constraints. We operationalize this constraint by comparing the target policy's cost distribution to that of a reference policy within an Optimal Transport (OT) framework, using entropic regularization and Sinkhorn iterations to obtain a differentiable and computationally efficient objective for stable end-to-end optimization. Furthermore, we introduce quantile-weighted FSD constraints and show that weighted FSD universally controls a broad class of Spectral Risk Measures (SRMs), so that improvements under weighted dominance imply guaranteed improvements in the corresponding spectral risk. This provides a principled mechanism for tuning a model's risk profile via the quantile weighting function. Empirical results demonstrate that RAD improves harmlessness over baselines while remaining competitive in helpfulness, and exhibits greater robustness on out-of-distribution harmlessness evaluations.
中文标题/摘要
标题:超越预期的Safe RLHF:随机占优的普遍光谱风险控制
Safe强化学习从人类反馈(RLHF)通常通过期望成本约束来确保安全性,但期望值只捕捉成本分布的一个统计量,未能考虑分布不确定性,特别是在重尾或罕见灾难性事件下。当鲁棒性和风险敏感性至关重要时,这一限制是存在问题的。随机占优提供了一种替代方案,通过比较整个成本分布而非仅仅平均值来实现,从而直接控制尾部风险和潜在的分布外故障,这些是基于期望约束可能忽略的。在本文中,我们提出了风险敏感对齐通过占优(RAD),这是一种新颖的对齐框架,用一阶随机占优(FSD)约束替代标量期望成本约束。我们通过在最优传输(OT)框架下比较目标策略的成本分布和参考策略的成本分布来实现这一约束,使用熵正则化和Sinkhorn迭代来获得一个可微且计算高效的优化目标,以实现稳定的一体化优化。此外,我们引入了加权一阶随机占优约束,并证明加权随机占优可以控制广泛的光谱风险度量(SRM),加权占优下的改进意味着在相应光谱风险下的保证改进。这提供了一种通过分位数加权函数调节模型风险配置的原理机制。实验证明,RAD在减少危害性方面优于基线,同时在有益性方面保持竞争力,并在分布外危害性评估中表现出更强的鲁棒性。
Summary / 总结
This paper addresses the limitations of Safe RLHF by proposing a new framework called Risk-sensitive Alignment via Dominance (RAD), which uses First-Order Stochastic Dominance (FSD) constraints to control the entire cost distribution rather than just the expected cost. The method compares the cost distribution of the target policy to a reference policy using an Optimal Transport framework with entropic regularization, enabling stable and efficient optimization. Empirical results show that RAD enhances safety while maintaining performance and robustness against out-of-distribution failures.
本文提出了Risk-sensitive Alignment via Dominance (RAD) 方法,使用First-Order Stochastic Dominance (FSD) 约束来控制整个成本分布,而不是仅仅控制期望成本。该方法通过最优传输框架和熵正则化来比较目标策略的成本分布与参考策略的成本分布,确保了可微分和高效的优化目标。实验证明,RAD 在增强安全性的同时保持了有效性,并且在应对离分布失败方面表现出了更高的鲁棒性。
Inferring Clinically Relevant Molecular Subtypes of Pancreatic Cancer from Routine Histopathology Using Deep Learning
Authors: Abdul Rehman Akbar, Alejandro Levya, Ashwini Esnakula, Elshad Hasanov, Anne Noonan, Lingbin Meng, Susan Tsai, Vaibhav Sahai, Midhun Malla, Sarbajit Mukherjee, Upender Manne, Anil Parwani, Wei Chen, Ashish Manne, Muhammad Khalid Khan Niazi
First: 2026-01-06T20:52:12+00:00 · Latest: 2026-03-11T16:17:04+00:00
Abstract
Molecular subtyping of PDAC into basal-like and classical has established prognostic and predictive value. However, its use in clinical practice is limited by cost, turnaround time, and tissue requirements, thereby restricting its application in the management of PDAC. We introduce PanSubNet, an interpretable deep learning framework that predicts therapy-relevant molecular subtypes directly from standard H&E-stained WSIs. PanSubNet was developed using data from 1,055 patients across two multi-institutional cohorts (PANCAN, n=846; TCGA, n=209) with paired histology and RNA-seq data. Ground-truth labels were derived using the validated Moffitt 50-gene signature refined by GATA6 expression. The model employs dual-scale architecture that fuses cellular-level morphology with tissue-level architecture, leveraging attention mechanisms for multi-scale representation learning and transparent feature attribution. On internal validation within PANCAN using five-fold cross-validation, PanSubNet achieved mean AUC of 88.5% with balanced sensitivity and specificity. External validation on the independent TCGA cohort without fine-tuning demonstrated robust generalizability (AUC 84.0%). PanSubNet preserved and, in metastatic disease, strengthened prognostic stratification compared to RNA-seq based labels. Prediction uncertainty linked to intermediate transcriptional states, not classification noise. Model predictions are aligned with established transcriptomic programs, differentiation markers, and DNA damage repair signatures. By enabling rapid, cost-effective molecular stratification from routine H&E-stained slides, PanSubNet offers a clinically deployable and interpretable tool for genetic subtyping. We are gathering data from two institutions to validate and assess real-world performance, supporting integration into digital pathology workflows and advancing precision oncology for PDAC.
中文标题/摘要
标题:利用深度学习从常规组织病理学中推断胰腺癌的临床相关分子亚型
胰腺导管腺癌(PDAC)的基底样和经典亚型分子分型具有预后和预测价值。然而,由于成本、周转时间和组织需求的限制,其在临床实践中的应用受到限制。我们引入了PanSubNet,这是一种可解释的深度学习框架,可以直接从标准HE染色WSI中预测与治疗相关的分子亚型。PanSubNet使用了来自两个多机构队列(PANCAN,n=846;TCGA,n=209)的1,055名患者的配对组织学和RNA-seq数据进行开发。真实标签基于经过GATA6表达验证的Moffitt 50基因签名。该模型采用双尺度架构,融合了细胞水平形态与组织水平结构,利用注意力机制进行多尺度表示学习和透明特征归因。在PANCAN内部验证中,使用五折交叉验证,PanSubNet的平均AUC为88.5%,具有平衡的敏感性和特异性。在独立的TCGA队列中进行外部验证,无需微调,显示出稳健的泛化能力(AUC 84.0%)。与基于RNA-seq的标签相比,PanSubNet在预后分层上保持了并加强了在转移性疾病中的预后分层。预测不确定性与中间转录状态相关,而非分类噪声。模型预测与已建立的转录组程序、分化标记和DNA损伤修复签名一致。通过从常规HE染色切片中实现快速、低成本的分子分型,PanSubNet提供了一种临床可部署且可解释的基因分型工具。我们正在从两个机构收集数据以验证和评估其在实际中的表现,支持其整合到数字病理工作流程中,并推动胰腺导管腺癌的精准肿瘤学发展。
Summary / 总结
The study aims to develop a deep learning framework, PanSubNet, to predict clinically relevant molecular subtypes of pancreatic ductal adenocarcinoma (PDAC) directly from routine histopathology slides, addressing the limitations of current molecular subtyping methods. PanSubNet uses a dual-scale architecture that combines cellular-level morphology with tissue-level architecture, achieving high accuracy with an AUC of 88.5% on internal validation and robust generalizability on an independent cohort (AUC 84.0%). The model preserves and enhances prognostic stratification compared to RNA-seq based labels and aligns with established transcriptomic programs and markers, offering a cost-effective and interpretable tool for genetic subtyping in PDAC management.
研究旨在开发一个深度学习框架PanSubNet,直接从标准苏木精和伊红(H&E)染色的全切片图像(WSI)中预测胰腺导管腺癌(PDAC)的临床相关分子亚型。PanSubNet采用双尺度架构,结合细胞级形态学与组织级结构,内部验证的AUC为88.5%,外部验证的AUC为84.0%,并且模型在保留和改善预后分层方面表现优于RNA-seq基线标签。
Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment
Authors: Fanqi Yu, Matteo Tiezzi, Tommaso Apicella, Cigdem Beyan, Vittorio Murino
First: 2026-03-11T16:13:19+00:00 · Latest: 2026-03-11T16:13:19+00:00
Abstract
We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. Our approach departs from conventional experience replay by operating entirely in a multimodal latent space, where compact representations of visual, linguistic, and robot's state information are stored and reused to support future learning. To further stabilize adaptation, we introduce an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving inter-task distinctiveness. Our method establishes a new state of the art in the LIBERO benchmarks, achieving 10-17 point gains in AUC and up to 65% less forgetting compared to previous leading methods. Ablation studies confirm the effectiveness of each component, showing consistent gains over alternative strategies. The code is available at: https://github.com/yfqi/lifelong_mlr_ifa.
中文标题/摘要
标题:终身模仿学习:多模态潜在重播与增量调整
我们提出了一种终身模仿学习框架,能够在现实的内存和数据约束下,实现跨序列任务的持续策略优化。我们的方法不同于传统的经验重播,完全在多模态潜在空间中操作,将视觉、语言和机器人状态信息的紧凑表示存储和重用以支持未来的学习。为了进一步稳定适应,我们引入了一种增量特征调整机制,通过角度间隔约束来规范任务嵌入的演变,从而保持任务间的差异性。我们的方法在LIBERO基准测试中达到了新的最佳状态,在AUC上取得了10-17分的提升,并且与之前领先方法相比,遗忘率降低了65%。消融研究证实了每个组件的有效性,显示了相对于替代策略的一致性改进。代码可在:https://github.com/yfqi/lifelong_mlr_ifa 获取。
Summary / 总结
The research aims to develop a lifelong imitation learning framework for continual policy refinement under memory and data constraints. It uses a multimodal latent space to store compact representations of visual, linguistic, and robot state information, which are reused for future learning. An incremental feature adjustment mechanism with an angular margin constraint is introduced to stabilize adaptation and preserve task distinctiveness. The method outperforms previous approaches in the LIBERO benchmarks, achieving significant improvements in AUC and reducing forgetting by up to 65%. Ablation studies validate the effectiveness of each component.
研究旨在开发一种在内存和数据限制下的终身模仿学习框架,以实现持续的策略改进。该方法使用多模态隐空间存储视觉、语言和机器人状态信息的紧凑表示,并在未来的学习中重用这些信息。引入了具有角度边距约束的增量特征调整机制,以稳定适应并保持任务的差异性。该方法在LIBERO基准测试中优于先前的方法,实现了显著的AUC改进,并将遗忘率降低了最多65%。消融研究验证了每个组件的有效性。
BLITZRANK: Principled Zero-shot Ranking Agents with Tournament Graphs
Authors: Sheshansh Agrawal, Thien Hang Nguyen, Douwe Kiela
First: 2026-02-05T08:41:00+00:00 · Latest: 2026-03-11T16:12:46+00:00
Abstract
Selecting the top $m$ from $n$ items via expensive $k$-wise comparisons is central to settings ranging from LLM-based document reranking to crowdsourced evaluation and tournament design. Existing methods either rely on heuristics that fail to fully exploit the information each comparison reveals, or are inefficient when they do. We introduce a tournament graph framework that provides a principled foundation for $k$-wise ranking. Our key observation is that each $k$-item comparison reveals a complete tournament of $\binom{k}{2}$ pairwise preferences; aggregating these into a global preference graph and computing its transitive closure yields many additional orderings without further oracle calls. We formalize when an item's rank is certifiably determined and design a greedy query schedule that maximizes information gain towards identifying the top-$m$ items. The framework also gracefully handles non-transitive preferences (cycles induced by real-world oracles) by collapsing them into equivalence classes that yield principled tiered rankings. Applied to LLM reranking across 14 benchmarks and 5 models, our method achieves Pareto dominance over existing approaches: matching or exceeding accuracy while requiring 25-40% fewer tokens than comparable methods, and $7\times$ fewer than pairwise reranking at near-identical quality.
中文标题/摘要
标题:BLITZRANK:基于锦标赛图的原理性零样本排名代理
从$n$个项目中选择前$m$个项目,通过昂贵的$k$次比较,是从基于LLM的文档重排序到众包评估和锦标赛设计等多个场景的核心问题。现有方法要么依赖于未能充分利用每次比较所揭示信息的启发式方法,要么在利用这些信息时效率低下。我们提出了一种锦标赛图框架,为$k$次比较提供了原理性的基础。我们的关键观察是,每次$k$项比较揭示了一个完整的包含$\binom{k}{2}$个两两偏好的锦标赛;将这些偏好聚合到全局偏好图中并计算其传递闭包,可以得到许多额外的排序而无需进一步的查询。我们形式化了何时一项的排名可以被认证确定,并设计了一种贪婪的查询调度,以最大化识别前$m$项信息增益。该框架还优雅地处理了非传递性偏好(由现实世界或acles引起的循环),通过将它们合并为等价类来生成分层排名。应用于14个基准和5个模型的LLM重排序,我们的方法在帕累托优势上优于现有方法:在匹配或超过准确率的同时,比可比方法少需要25-40%的令牌,并且在质量相近的情况下,比两两重排序少需要7倍的令牌。
Summary / 总结
The paper introduces BLITZRANK, a method for selecting the top $m$ items from $n$ via $k$-wise comparisons. It uses a tournament graph framework to aggregate pairwise preferences from each $k$-wise comparison, enabling efficient ranking without additional oracle calls. The method achieves Pareto dominance over existing approaches by matching or exceeding accuracy while using fewer tokens and significantly fewer than pairwise comparisons.
论文提出了BLITZRANK方法,通过$k$-wise比较从$n$个物品中选出前$m$个。它利用tournament图框架聚合每次$k$-wise比较中的两两偏好,计算传递闭包来确定物品排名,无需额外的oracle调用。应用于LLM重排序,BLITZRANK在更少的token使用下达到更高的准确性,比现有方法少$7 imes$的token使用量,且质量相近。
EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation
Authors: Shih-Yang Liu, Maksim Khadkevich, Nai Chit Fung, Charbel Sakr, Chao-Han Huck Yang, Chien-Yi Wang, Saurav Muralidharan, Hongxu Yin, Kwang-Ting Cheng, Jan Kautz, Yu-Chiang Frank Wang, Pavlo Molchanov, Min-Hung Chen
Venue: ICLR 2026
First: 2024-10-28T17:59:03+00:00 · Latest: 2026-03-11T16:03:41+00:00
Comments: ICLR 2026 workshops. Code: https://github.com/NVlabs/EoRA
Abstract
While post-training compression techniques effectively reduce the memory footprint, latency, and power consumption of Large Language Models (LLMs), they often result in noticeable accuracy degradation and remain limited by hardware and kernel constraints that restrict supported compression formats ultimately reducing flexibility across a wide range of deployment scenarios. In this work, we propose EoRA, a novel fine-tuning-free method that augments compressed LLMs with low-rank matrices, allowing users to rapidly enhance task-specific performance and freely balance the trade-off between accuracy and computational overhead beyond the constraints of compression formats. EoRA consistently outperforms prior training-free low rank methods in recovering the accuracy of compressed LLMs, achieving notable accuracy improvements (e.g., $\mathbf{10.84\%}$ on ARC-Challenge, $\mathbf{6.74\%}$ on MathQA, and $\mathbf{11.45\%}$ on GSM8K) for LLaMA3-8B compressed to 3-bit. We also introduce an optimized CUDA kernel, accelerating inference by up to 1.4x and reducing memory overhead through quantizing EoRA. Overall, EoRA offers a prompt solution for improving the accuracy of compressed models under varying user requirements, enabling more efficient and flexible deployment of LLMs. Code is available at https://github.com/NVlabs/EoRA.
中文标题/摘要
标题:EoRA:基于特征空间低秩逼近的压缩大语言模型补偿方法
虽然后训练压缩技术有效地减少了大语言模型(LLM)的内存占用、延迟和功耗,但它们通常会导致明显的准确度下降,并且受限于硬件和内核约束,限制了支持的压缩格式,最终减少了在广泛部署场景中的灵活性。在本文中,我们提出了一种名为EoRA的新型无微调方法,该方法通过低秩矩阵增强压缩的LLM,使用户能够快速提升任务特定性能,并自由平衡准确度和计算开销之间的权衡,超越压缩格式的限制。EoRA在恢复压缩LLM的准确度方面始终优于先前的无训练低秩方法,在压缩到3比特的LLaMA3-8B上实现了显著的准确度提升(例如,在ARC-Challenge上提高了10.84%,在MathQA上提高了6.74%,在GSM8K上提高了11.45%)。我们还引入了一个优化的CUDA内核,通过量化EoRA加速推理多达1.4倍,并减少内存开销。总体而言,EoRA为满足不同用户需求提高压缩模型的准确度提供了一种简便的解决方案,使大语言模型的部署更加高效和灵活。代码可在https://github.com/NVlabs/EoRA获取。
Summary / 总结
EoRA is a fine-tuning-free method that enhances the performance of compressed LLMs by adding low-rank matrices, allowing users to balance accuracy and computational overhead. It outperforms previous training-free low-rank methods, achieving significant accuracy improvements on various benchmarks. EoRA also includes an optimized CUDA kernel that accelerates inference and reduces memory overhead.
EoRA 是一种无需微调的方法,通过添加低秩矩阵来提升压缩 LLM 的性能,允许用户调整准确率与计算开销之间的权衡。该方法在 ARC-Challenge、MathQA 和 GSM8K 上实现了显著的准确率提升,优于之前的无需训练的低秩方法。此外,该方法还包含一个优化的 CUDA 内核,可以加速推理并减少内存开销。