arXiv 论文速递

2026-01-06 03:27
Snapshot: 20260106_0327
Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning
Authors: Valentin Noël
First: 2026-01-02T18:49:37+00:00 · Latest: 2026-01-02T18:49:37+00:00
Comments: 58 pages, 19 figures, Under Review
Abstract
We present a training-free method for detecting valid mathematical reasoning in large language models through spectral analysis of attention patterns. By treating attention matrices as adjacency matrices of dynamic graphs over tokens, we extract four interpretable spectral diagnostics, the Fiedler value (algebraic connectivity), high-frequency energy ratio (HFER), graph signal smoothness, and spectral entropy, that exhibit statistically significant differences between valid and invalid mathematical proofs. Experiments across seven transformer models from four independent architectural families (Meta Llama, Alibaba Qwen, Microsoft Phi, and Mistral AI) demonstrate that this spectral signature produces effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling 85.0--95.6\% classification accuracy under rigorous evaluation, with calibrated thresholds reaching 93--95\% on the full dataset. The method requires no training data, fine-tuning, or learned classifiers: a single threshold on a spectral metric suffices for high accuracy. Through systematic label correction, we discover that the spectral method detects logical coherence rather than compiler acceptance, identifying mathematically valid proofs that formal verifiers reject due to technical failures. We further identify an architectural dependency: Mistral-7B's Sliding Window Attention shifts the discriminative signal from HFER to late-layer Smoothness ($d = 2.09$, $p_{\text{MW}} = 1.16 \times 10^{-48}$), revealing that attention mechanism design affects which spectral features capture reasoning validity. These findings establish spectral graph analysis as a principled framework for reasoning verification with immediate applications to hallucination detection and AI safety monitoring.
中文标题/摘要
标题:理性几何:有效数学推理的光谱特征
我们提出了一种无需训练的方法,通过光谱分析注意力模式来检测大型语言模型中的有效数学推理。通过将注意力矩阵视为动态图的邻接矩阵,我们提取了四个可解释的光谱诊断指标:Fiedler 值(代数连通性)、高频能量比(HFER)、图信号平滑性和光谱熵,这些指标在有效和无效数学证明之间表现出统计学上的显著差异。在四个独立架构家族(Meta Llama、阿里巴巴 Qwen、微软 Phi 和 Mistral AI)的七个变压器模型上进行的实验表明,这种光谱特征产生的效应大小高达 Cohen's $d = 3.30$ ($p < 10^{-116}$),在严格的评估下可实现 85.0–95.6% 的分类准确率,且在完整数据集上校准的阈值达到 93–95%。该方法不需要训练数据、微调或学习分类器:只需一个光谱指标的阈值即可实现高准确率。通过系统性的标签修正,我们发现光谱方法检测的是逻辑连贯性而非编译器接受,识别出形式验证器因技术故障而拒绝的数学上有效的证明。我们还发现一种架构依赖性:Mistral-7B 的滑动窗口注意力将区分信号从 HFER 转移到晚期层平滑性 ($d = 2.09$, $p_{\text{MW}} = 1.16 \times 10^{-48}$),揭示了注意力机制设计影响哪些光谱特征捕捉推理有效性的事实。这些发现确立了光谱图分析作为推理验证的原理性框架,并立即应用于幻觉检测和 AI 安全监控。
Summary / 总结
The study introduces a training-free method to detect valid mathematical reasoning in large language models by analyzing spectral diagnostics derived from attention patterns. By treating attention matrices as adjacency matrices of dynamic graphs, four spectral diagnostics—Fiedler value, high-frequency energy ratio, graph signal smoothness, and spectral entropy—are extracted, showing significant differences between valid and invalid proofs. Experiments across seven transformer models from different architectural families demonstrate high classification accuracy (85.0–95.6%) and calibrated thresholds (93–95%) with these spectral signatures. The method identifies logical coherence rather than compiler acceptance and reveals architectural dependencies affecting the discriminative signal.
研究提出了一种无需训练的方法,通过分析注意力模式的谱特性来识别大型语言模型中的有效数学推理。通过将注意力矩阵转换为动态图,作者提取了四个谱诊断指标:Fiedler值、高频率能量比、图信号平滑性和谱熵。这些指标在有效和无效证明之间显示出显著差异。实验表明,这些谱签名在七个来自四个架构家族的变压器模型中实现了高分类准确率(85.0–95.6%)和校准阈值(93–95%)。该方法检测逻辑连贯性而非编译接受,并揭示了架构依赖性如何影响区分信号。
Adapting Natural Language Processing Models Across Jurisdictions: A pilot Study in Canadian Cancer Registries
Authors: Jonathan Simkin, Lovedeep Gondara, Zeeshan Rizvi, Gregory Doyle, Jeff Dowden, Dan Bond, Desmond Martin, Raymond Ng
First: 2026-01-02T18:46:19+00:00 · Latest: 2026-01-02T18:46:19+00:00
Abstract
Population-based cancer registries depend on pathology reports as their primary diagnostic source, yet manual abstraction is resource-intensive and contributes to delays in cancer data. While transformer-based NLP systems have improved registry workflows, their ability to generalize across jurisdictions with differing reporting conventions remains poorly understood. We present the first cross-provincial evaluation of adapting BCCRTron, a domain-adapted transformer model developed at the British Columbia Cancer Registry, alongside GatorTron, a biomedical transformer model, for cancer surveillance in Canada. Our training dataset consisted of approximately 104,000 and 22,000 de-identified pathology reports from the Newfoundland & Labrador Cancer Registry (NLCR) for Tier 1 (cancer vs. non-cancer) and Tier 2 (reportable vs. non-reportable) tasks, respectively. Both models were fine-tuned using complementary synoptic and diagnosis focused report section input pipelines. Across NLCR test sets, the adapted models maintained high performance, demonstrating transformers pretrained in one jurisdiction can be localized to another with modest fine-tuning. To improve sensitivity, we combined the two models using a conservative OR-ensemble achieving a Tier 1 recall of 0.99 and reduced missed cancers to 24, compared with 48 and 54 for the standalone models. For Tier 2, the ensemble achieved 0.99 recall and reduced missed reportable cancers to 33, compared with 54 and 46 for the individual models. These findings demonstrate that an ensemble combining complementary text representations substantially reduce missed cancers and improve error coverage in cancer-registry NLP. We implement a privacy-preserving workflow in which only model weights are shared between provinces, supporting interoperable NLP infrastructure and a future pan-Canadian foundation model for cancer pathology and registry workflows.
中文标题/摘要
标题:跨司法管辖区适应自然语言处理模型:加拿大癌症登记处试点研究
基于人口的癌症登记处依赖于病理报告作为其主要诊断来源,但手动提取信息资源密集且会导致癌症数据延迟。虽然基于变换器的NLP系统已改善了登记流程,但它们在不同报告惯例的司法管辖区中的泛化能力仍不明确。我们首次对跨省评估了在不列颠哥伦比亚癌症登记处开发的领域适应变换器模型BCCRTron及其与生物医学变换器模型GatorTron的适应性进行研究,用于加拿大的癌症监测。我们的训练数据集包括来自纽芬兰与拉布拉多癌症登记处(NLCR)的约104,000份和22,000份匿名病理报告,分别用于第一级(癌症 vs. 非癌症)和第二级(可报告 vs. 非可报告)任务。两种模型均使用互补的综合报告和诊断重点报告部分输入管道进行了微调。在NLCR测试集中,适应后的模型保持了高性能,表明在一处司法管辖区预训练的变换器可以在另一处通过适度微调进行本地化。为了提高敏感性,我们使用保守的OR-集成组合了两种模型,第一级召回率达到0.99,漏诊癌症减少至24例,而单个模型分别为48例和54例。对于第二级,集成模型的召回率为0.99,漏诊可报告癌症减少至33例,而单个模型分别为54例和46例。这些发现表明,结合互补文本表示的集成模型在癌症登记处NLP中显著减少了漏诊癌症并提高了错误覆盖范围。我们实施了一种隐私保护的工作流程,仅在省之间共享模型权重,支持互操作的NLP基础设施,并为癌症病理和登记流程建立未来泛加拿大的基础模型。
Summary / 总结
The study aims to evaluate the adaptability of transformer-based NLP models across different jurisdictions for cancer surveillance. BCCRTron, a domain-adapted transformer model, and GatorTron, a biomedical transformer model, were fine-tuned using pathology reports from the Newfoundland & Labrador Cancer Registry. The adapted models maintained high performance across the test sets, and an ensemble of the two models improved recall and reduced missed cancers for both Tier 1 and Tier 2 tasks. This demonstrates that transformer models pretrained in one jurisdiction can be effectively localized with modest fine-tuning and that combining complementary models can further enhance accuracy in cancer registry NLP workflows.
研究旨在评估基于变压器的NLP模型在不同司法管辖区的适应性,以支持癌症登记工作流程。BCCRTron和GatorTron两种模型分别在纽芬兰与拉布拉多癌症登记处的匿名病理报告上进行了微调。经过调整的模型保持了高性能,而两种模型的组合提高了召回率并减少了漏诊癌症的数量,适用于第一级和第二级任务。这表明,通过适度微调和隐私保护的工作流程,本地化的变压器模型可以支持癌症监测。
FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing
Authors: Sunny Gupta, Amit Sethi
First: 2026-01-02T18:40:41+00:00 · Latest: 2026-01-02T18:40:41+00:00
Comments: 10 pages, 1 figures, Accepted at AAI'26
Abstract
Federated data sharing promises utility without centralizing raw data, yet existing embedding-level generators struggle under non-IID client heterogeneity and provide limited formal protection against gradient leakage. We propose FedHypeVAE, a differentially private, hypernetwork-driven framework for synthesizing embedding-level data across decentralized clients. Building on a conditional VAE backbone, we replace the single global decoder and fixed latent prior with client-aware decoders and class-conditional priors generated by a shared hypernetwork from private, trainable client codes. This bi-level design personalizes the generative layerrather than the downstream modelwhile decoupling local data from communicated parameters. The shared hypernetwork is optimized under differential privacy, ensuring that only noise-perturbed, clipped gradients are aggregated across clients. A local MMD alignment between real and synthetic embeddings and a Lipschitz regularizer on hypernetwork outputs further enhance stability and distributional coherence under non-IID conditions. After training, a neutral meta-code enables domain agnostic synthesis, while mixtures of meta-codes provide controllable multi-domain coverage. FedHypeVAE unifies personalization, privacy, and distribution alignment at the generator level, establishing a principled foundation for privacy-preserving data synthesis in federated settings. Code: github.com/sunnyinAI/FedHypeVAE
中文标题/摘要
标题:FedHypeVAE:联邦学习中的超网络生成条件VAE差分隐私嵌入共享
联邦数据共享可以在不集中原始数据的情况下提供实用性,但现有的嵌入级生成器在面对非IID客户端异构性时表现不佳,并且提供的正式保护有限,以防止梯度泄漏。我们提出了一种名为FedHypeVAE的差分隐私框架,该框架由超网络驱动,用于在分散的客户端之间合成嵌入级数据。基于条件VAE架构,我们用客户端感知的解码器和由共享超网络从私有的可训练客户端代码生成的类条件先验替换单一的全局解码器和固定的先验。这种两层设计个性化了生成层而不是下游模型,同时将本地数据与通信参数解耦。共享的超网络在差分隐私下进行优化,确保只有噪声扰动和裁剪后的梯度在客户端之间聚合。真实嵌入和合成嵌入之间的局部MMD对齐以及超网络输出的Lipschitz正则化进一步增强了在非IID条件下的一致性和分布稳定性。训练完成后,中立的元代码实现领域无关的合成,而元代码的混合提供可控的多领域覆盖。FedHypeVAE在生成器级别统一了个性化、隐私和分布对齐,为联邦设置中的隐私保护数据合成奠定了原则性的基础。代码:github.com/sunnyinAI/FedHypeVAE
Summary / 总结
FedHypeVAE is a federated learning framework that uses a hypernetwork to generate conditional VAEs for differentially private embedding sharing. It addresses the challenges of non-IID client heterogeneity and gradient leakage by personalizing the generative layers with client-aware decoders and class-conditional priors, while ensuring differential privacy through noise-perturbed gradient aggregation. The framework enhances stability and distributional coherence with local MMD alignment and a Lipschitz regularizer, allowing for domain-agnostic and controllable multi-domain synthesis after training.
FedHypeVAE 是一种联邦学习框架,通过超网络生成客户端特定的解码器和类别条件先验,实现差分隐私下的嵌入共享。它用客户端特定的解码器和先验替代了全局解码器和固定先验,确保个性化而不集中数据。该框架包括在差分隐私下优化的共享超网络,以及 MMD 对齐和 Lipschitz 正则化等技术,以增强在非同态条件下的稳定性和分布一致性。主要发现表明,在非同态设置中,与现有方法相比,该框架在性能和隐私方面表现出改进。
Benchmark Success, Clinical Failure: When Reinforcement Learning Optimizes for Benchmarks, Not Patients
Authors: Armin Berger, Manuela Bergau, Helen Schneider, Saad Ahmad, Tom Anglim Lagones, Gianluca Brugnara, Martha Foltyn-Dumitru, Kai Schlamp, Philipp Vollmuth, Rafet Sifa
First: 2025-12-28T21:57:42+00:00 · Latest: 2026-01-02T18:25:09+00:00
Abstract
Recent Reinforcement Learning (RL) advances for Large Language Models (LLMs) have improved reasoning tasks, yet their resource-constrained application to medical imaging remains underexplored. We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU. Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). This mirrors high-resource models like NV-Reason-CXR-3B, suggesting the issue stems from the RL paradigm rather than scale. We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic features. Furthermore, cross-model comparisons show structured reasoning scaffolds benefit general-purpose VLMs but offer minimal gain for medically pre-trained models. Consequently, curated supervised fine-tuning may outperform aggressive RL for clinical deployment requiring robustness across diverse populations.
中文标题/摘要
标题:基准成功,临床失败:当强化学习优化基准而非患者
近期针对大型语言模型(LLMs)的强化学习(RL)进展在推理任务上取得了改进,但其在医疗成像领域的资源受限应用仍被严重忽视。我们引入了ChexReason,这是一种通过R1风格方法(SFT后接GRPO)训练的视觉-语言模型,仅使用了2,000个SFT样本、1,000个RL样本和一个A100 GPU。在CheXpert和NIH基准上的评估揭示了一个根本性的矛盾:GRPO恢复了分布内性能(在CheXpert上提高了23%,宏F1分数为0.346),但降低了跨数据集的迁移性(在NIH上下降了19%)。这与高资源模型如NV-Reason-CXR-3B的表现相似,表明问题可能源自RL范式而非规模。我们发现了一种泛化悖论,即SFT检查点在优化前对NIH的性能有所提升,表明教师引导的推理捕捉到了更多机构无关的特征。此外,跨模型比较显示,结构化推理框架对通用视觉语言模型有益,但对医学预训练模型的增益有限。因此,精心策划的监督微调可能在需要跨多样人群稳健性的临床部署中优于激进的RL方法。
Summary / 总结
The paper explores the application of Reinforcement Learning (RL) in medical imaging using a vision-language model, ChexReason, trained with limited resources. Despite improving in-distribution performance on CheXpert and NIH benchmarks, RL optimization degrades cross-dataset transferability, highlighting a fundamental tension. The study suggests that the RL paradigm itself may be the issue, rather than model scale, and identifies a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization. The research indicates that curated supervised fine-tuning might be more effective for clinical deployment requiring robustness across diverse populations.
研究探讨了在医疗影像中应用强化学习(RL)的方法,使用了仅用少量资源训练的视觉-语言模型ChexReason。尽管该模型在CheXpert和NIH基准测试中提高了内部性能,但在跨数据集迁移性方面表现较差,表明基准成功与临床应用之间存在根本矛盾。研究指出,这一问题可能是由RL范式本身引起的,而不是规模不足。关键发现包括SFT检查点在优化前对NIH的独特改进以及对医学预训练模型而言,结构化推理带来的增益有限,这表明精心策划的监督微调可能比激进的RL更适合需要跨不同人群稳健性的临床部署。
Semantic Anchor Transport: Robust Test-Time Adaptation for Vision-Language Models
Authors: Shambhavi Mishra, Julio Silva-Rodriguez, Ismail Ben Ayed, Marco Pedersoli, Jose Dolz
First: 2024-11-26T00:15:37+00:00 · Latest: 2026-01-02T18:18:27+00:00
Comments: Added additional figures to communicate the algorithm
Abstract
Large pre-trained vision-language models (VLMs), such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, these models may be unreliable under distributional shifts, as their performance is significantly degraded. In this work, we investigate how to efficiently utilize class text information to mitigate distribution drifts encountered by VLMs during inference. In particular, we propose generating pseudo-labels for the noisy test-time samples by aligning visual embeddings with reliable, text-based semantic anchors. Specifically, to maintain the regular structure of the dataset properly, we formulate the problem as a batch-wise label assignment, which is efficiently solved using Optimal Transport. Our method, Semantic Anchor Transport (SAT), utilizes such pseudo-labels as supervisory signals for test-time adaptation, yielding a principled cross-modal alignment solution. Moreover, SAT further leverages heterogeneous textual clues, with a multi-template distillation approach that replicates multi-view contrastive learning strategies in unsupervised representation learning without incurring additional computational complexity. Extensive experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of SAT, achieving consistent performance gains over recent state-of-the-art methods, yet being computationally efficient.
中文标题/摘要
标题:语义锚点传输:视觉语言模型的鲁棒测试时适应
大型预训练视觉语言模型(VLMs),如CLIP,在广泛的任务中展示了前所未有的零样本性能。然而,这些模型在分布变化下可能不可靠,其性能会显著下降。在本文中,我们研究了如何高效利用类别文本信息来缓解VLMs在推理过程中遇到的分布漂移。特别是,我们提出通过将视觉嵌入与可靠的、基于文本的语义锚点对齐来生成噪声测试样本的伪标签。具体而言,为了保持数据集的正常结构,我们将问题形式化为批量标签分配问题,该问题可以使用最优传输高效求解。我们的方法,语义锚点传输(SAT),利用这些伪标签作为测试时适应的监督信号,提供了一种原理性的跨模态对齐解决方案。此外,SAT进一步利用了异构文本线索,通过多模板蒸馏方法复制无监督表示学习中的多视图对比学习策略,而不增加额外的计算复杂度。在多个流行的测试时适应基准上的广泛实验中,SAT在多种复杂性上表现出优越性,相对于最近的先进方法实现了持续的性能提升,同时计算效率高。
Summary / 总结
This work addresses the issue of distributional shifts in large pre-trained vision-language models (VLMs) like CLIP, which can degrade their performance. The authors propose Semantic Anchor Transport (SAT), a method that generates pseudo-labels for test-time samples by aligning visual embeddings with reliable text-based semantic anchors using Optimal Transport. SAT then uses these pseudo-labels for test-time adaptation, achieving consistent performance gains over recent state-of-the-art methods while maintaining computational efficiency. Extensive experiments on various benchmarks demonstrate SAT's effectiveness in cross-modal alignment and test-time adaptation.
该研究针对大型预训练视觉-语言模型(如CLIP)在分布变化下性能下降的问题,提出了一种名为Semantic Anchor Transport (SAT)的方法。该方法通过使用最优传输将视觉嵌入与可靠的文本语义锚点对齐来生成测试样本的伪标签,并利用这些伪标签进行测试时的自适应,实现了与最新最优方法相比的一致性能提升,同时保持了计算效率。在多种基准测试上的广泛实验表明了SAT的有效性。
Investigating the Viability of Employing Multi-modal Large Language Models in the Context of Audio Deepfake Detection
Authors: Akanksha Chuchra, Shukesh Reddy, Sudeepta Mishra, Abhijit Das, Abhinav Dhall
First: 2026-01-02T18:17:22+00:00 · Latest: 2026-01-02T18:17:22+00:00
Comments: Accepted at IJCB 2025
Abstract
While Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) have shown strong generalisation in detecting image and video deepfakes, their use for audio deepfake detection remains largely unexplored. In this work, we aim to explore the potential of MLLMs for audio deepfake detection. Combining audio inputs with a range of text prompts as queries to find out the viability of MLLMs to learn robust representations across modalities for audio deepfake detection. Therefore, we attempt to explore text-aware and context-rich, question-answer based prompts with binary decisions. We hypothesise that such a feature-guided reasoning will help in facilitating deeper multimodal understanding and enable robust feature learning for audio deepfake detection. We evaluate the performance of two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, in two evaluation modes: (a) zero-shot and (b) fine-tuned. Our experiments demonstrate that combining audio with a multi-prompt approach could be a viable way forward for audio deepfake detection. Our experiments show that the models perform poorly without task-specific training and struggle to generalise to out-of-domain data. However, they achieve good performance on in-domain data with minimal supervision, indicating promising potential for audio deepfake detection.
中文标题/摘要
标题:探究多模态大型语言模型在音频换音检测中的可行性
尽管视觉-语言模型(VLMs)和多模态大型语言模型(MLLMs)在检测图像和视频换音方面表现出强大的泛化能力,但它们在音频换音检测中的应用仍鲜有探索。本研究旨在探索MLLMs在音频换音检测中的潜力。通过结合音频输入和一系列文本提示作为查询,以发现MLLMs在跨模态学习鲁棒表示方面的可行性。因此,我们尝试探索文本感知和语境丰富的问答式提示,并采用二元决策。我们假设这种特征引导的推理将有助于促进更深层次的多模态理解,并使音频换音检测中的特征学习更加稳健。我们评估了两种MLLMs,Qwen2-Audio-7B-Instruct和SALMONN,在两种评估模式下的性能:(a)零样本和(b)微调。我们的实验表明,结合音频与多提示方法可能是音频换音检测的一个可行方向。我们的实验显示,这些模型在缺乏任务特定训练的情况下表现不佳,并且难以泛化到域外数据。然而,它们在少量监督下对域内数据表现出良好的性能,表明音频换音检测具有良好的潜力。
Summary / 总结
This study investigates the use of Multi-modal Large Language Models (MLLMs) for detecting audio deepfakes. By combining audio inputs with various text prompts, the research explores the potential of MLLMs to learn robust representations across modalities. The study evaluates two MLLMs, Qwen2-Audio-7B-Instruct and SALMONN, in zero-shot and fine-tuned modes. The experiments show that while these models perform poorly without task-specific training, they achieve good performance on in-domain data with minimal supervision, indicating promising potential for audio deepfake detection.
本研究探讨了多模态大型语言模型(MLLMs)在音频深伪检测中的应用,重点在于结合音频输入和文本提示。研究假设文本感知和语境丰富的提示可以增强跨模态的理解和鲁棒特征学习。实验结果显示,Qwen2-Audio-7B-Instruct和SALMONN在零样本设置下表现较差,但在少量监督下对领域内数据表现出良好的性能,表明这些模型在适当训练下有潜在的应用价值。
Brain network science modelling of sparse neural networks enables Transformers and LLMs to perform as fully connected
Authors: Yingtao Zhang, Diego Cerretti, Jialin Zhao, Wenjing Wu, Ziheng Liao, Umberto Michieli, Carlo Vittorio Cannistraci
First: 2025-01-31T13:04:37+00:00 · Latest: 2026-01-02T18:15:12+00:00
Abstract
Dynamic sparse training (DST) can reduce the computational demands in ANNs, but faces difficulties in keeping peak performance at high sparsity levels. The Cannistraci-Hebb training (CHT) is a brain-inspired method for growing connectivity in DST. CHT leverages a gradient-free, topology-driven link regrowth, which has shown ultra-sparse (less than 1% connectivity) advantage across various tasks compared to fully connected networks. Yet, CHT suffers two main drawbacks: (i) its time complexity is $O(Nd^3)$ - N node network size, d node degree - restricting it to ultra-sparse regimes. (ii) it selects top link prediction scores, which is inappropriate for the early training epochs, when the network presents unreliable connections. Here, we design the first brain-inspired network model - termed bipartite receptive field (BRF) - to initialize the connectivity of sparse artificial neural networks. We further introduce a GPU-friendly matrix-based approximation of CH link prediction, reducing complexity to $O(N^3)$. We introduce the Cannistraci-Hebb training soft rule (CHTs), which adopts a flexible strategy for sampling connections in both link removal and regrowth, balancing the exploration and exploitation of network topology. Additionally, we integrate CHTs with a sigmoid gradual density decay (CHTss). Empirical results show that BRF offers performance advantages over previous network science models. Using 1% of connections, CHTs outperforms fully connected networks in MLP architectures on image classification tasks, compressing some networks to less than 30% of the nodes. Using 5% of the connections, CHTss outperforms fully connected networks in two Transformer-based machine translation tasks. Finally, at 30% connectivity, both CHTs and CHTss outperform other DST methods in language modeling task.
中文标题/摘要
标题:大脑网络科学建模的稀疏神经网络使Transformer和大语言模型能够表现得如同全连接网络
动态稀疏训练(DST)可以在减少ANNs的计算需求的同时,但在高稀疏度水平下保持峰值性能方面面临困难。Cannistraci-Hebb训练(CHT)是一种受大脑启发的方法,用于在DST中增加连接性。CHT利用无梯度、拓扑驱动的链接再生长,显示出与全连接网络相比,在各种任务中具有超稀疏(连接率低于1%)的优势。然而,CHT存在两个主要缺点:(i)其时间复杂度为$O(Nd^3)$ - N个节点网络大小,d个节点度 - 限制其仅适用于超稀疏区域。(ii)它选择顶级链接预测得分,在网络呈现不可靠连接的早期训练阶段是不合适的。在这里,我们设计了第一个受大脑启发的网络模型——称为双部分感受野(BRF)——以初始化稀疏人工神经网络的连接性。我们进一步引入了CH链接预测的GPU友好矩阵近似,将复杂度降低到$O(N^3)$。我们引入了Cannistraci-Hebb训练软规则(CHTs),它采用灵活的策略在链接删除和再生长中采样连接,平衡网络拓扑的探索和利用。此外,我们将CHTs与Sigmoid渐进密度衰减(CHTss)结合使用。实验证明,BRF在与之前的大脑网络科学模型相比提供了性能优势。使用1%的连接,CHTs在MLP架构上的图像分类任务中优于全连接网络,压缩某些网络到节点的不到30%。使用5%的连接,CHTss在两个基于Transformer的机器翻译任务中优于全连接网络。最后,在30%的连接性下,CHTs和CHTss在语言建模任务中均优于其他DST方法。
Summary / 总结
The research aims to improve the performance of dynamic sparse training (DST) in artificial neural networks (ANNs) by addressing the limitations of Cannistraci-Hebb training (CHT). The study introduces a bipartite receptive field (BRF) model to initialize connectivity and a GPU-friendly matrix-based approximation of CH link prediction, reducing computational complexity. Additionally, it proposes Cannistraci-Hebb training soft rule (CHTs) and integrates it with sigmoid gradual density decay (CHTss) to balance exploration and exploitation. The experiments show that using 1% of connections, CHTs outperforms fully connected networks in image classification tasks, and using 5% of connections, CHTss outperforms fully connected networks in machine translation tasks. At 30% connectivity, both CHTs and CHTss outperform other DST methods in language modeling tasks.
研究旨在通过借鉴脑网络科学的方法来提升动态稀疏训练神经网络(ANNs)的性能。通过引入脑启发的网络模型——双部分感受野(BRF)模型和灵活的连接采样策略(CHTs),改进了拓扑驱动的链接再生长方法(CHT)。这种方法降低了时间复杂度并在早期训练阶段提高了性能。实验结果表明,CHTs和CHTss在图像分类、机器翻译和语言建模等任务中,使用远少于全连接网络的连接数,仍能实现相当甚至更好的性能。
LLM Agents for Combinatorial Efficient Frontiers: Investment Portfolio Optimization
Authors: Simon Paquette-Greenbaum, Jiangbo Yu
First: 2026-01-02T18:02:13+00:00 · Latest: 2026-01-02T18:02:13+00:00
Abstract
Investment portfolio optimization is a task conducted in all major financial institutions. The Cardinality Constrained Mean-Variance Portfolio Optimization (CCPO) problem formulation is ubiquitous for portfolio optimization. The challenge of this type of portfolio optimization, a mixed-integer quadratic programming (MIQP) problem, arises from the intractability of solutions from exact solvers, where heuristic algorithms are used to find approximate portfolio solutions. CCPO entails many laborious and complex workflows and also requires extensive effort pertaining to heuristic algorithm development, where the combination of pooled heuristic solutions results in improved efficient frontiers. Hence, common approaches are to develop many heuristic algorithms. Agentic frameworks emerge as a promising candidate for many problems within combinatorial optimization, as they have been shown to be equally efficient with regard to automating large workflows and have been shown to be excellent in terms of algorithm development, sometimes surpassing human-level performance. This study implements a novel agentic framework for the CCPO and explores several concrete architectures. In benchmark problems, the implemented agentic framework matches state-of-the-art algorithms. Furthermore, complex workflows and algorithm development efforts are alleviated, while in the worst case, lower but acceptable error is reported.
中文标题/摘要
标题:组合有效前沿的LLM代理:投资组合优化
投资组合优化是所有主要金融机构中的一项任务。卡丹诺约束均值-方差投资组合优化(CCPO)问题表述是组合优化中普遍存在的形式。这种类型的投资组合优化面临的挑战是一个混合整数二次规划(MIQP)问题,由于精确求解器难以求解,通常使用启发式算法来寻找近似投资组合解决方案。CCPO 包含许多繁琐且复杂的流程,还需要大量的努力来开发启发式算法,其中组合的池启发式解决方案可以改善有效前沿。因此,常见的方法是开发许多启发式算法。代理框架作为组合优化中许多问题的有前途的候选者,因为它们在自动化大规模工作流方面与自动化同样有效,并且在算法开发方面表现出色,有时甚至超过人类水平。本研究实现了一个新颖的代理框架来解决CCPO,并探索了几种具体的架构。在基准问题中,实现的代理框架与最先进的算法相当。此外,复杂的流程和算法开发努力得到了缓解,最坏情况下报告了较低但可接受的误差。
Summary / 总结
This study addresses the challenge of cardinality constrained mean-variance portfolio optimization (CCPO), a mixed-integer quadratic programming problem, by developing an agentic framework. The method involves implementing several concrete architectures to automate complex workflows and improve efficient frontiers. The experimental results show that the agentic framework matches state-of-the-art algorithms in benchmark problems, while reducing the need for extensive heuristic algorithm development efforts and alleviating complex workflows, though with slightly lower but acceptable error in the worst case.
该研究针对投资组合优化中的约束均值-方差投资组合优化(CCPO)问题,这是一个混合整数二次规划问题。作者实现了一个代理框架来自动化复杂的流程和算法开发,基准问题中与最先进的算法匹配。代理框架减少了对大量启发式算法开发的需求,并简化了复杂的工作流程,尽管在最坏情况下报告了略低但可接受的误差。
C-VARC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models
Authors: Ping Wu, Guobin Shen, Dongcheng Zhao, Yuwei Wang, Yiting Dong, Yu Shi, Enmeng Lu, Feifei Zhao, Yi Zeng
First: 2025-06-02T09:56:59+00:00 · Latest: 2026-01-02T17:58:14+00:00
Abstract
Ensuring that Large Language Models (LLMs) align with mainstream human values and ethical norms is crucial for the safe and sustainable development of AI. Current value evaluation and alignment are constrained by Western cultural bias and incomplete domestic frameworks reliant on non-native rules; furthermore, the lack of scalable, rule-driven scenario generation methods makes evaluations costly and inadequate across diverse cultural contexts. To address these challenges, we propose a hierarchical value framework grounded in core Chinese values, encompassing three main dimensions, 12 core values, and 50 derived values. Based on this framework, we construct a large-scale Chinese Value Rule Corpus (C-VARC) containing over 250,000 value rules enhanced and expanded through human annotation. Experimental results demonstrate that scenarios guided by C-VARC exhibit clearer value boundaries and greater content diversity compared to those produced through direct generation. In the evaluation across six sensitive themes (e.g., surrogacy, suicide), seven mainstream LLMs preferred C-VARC generated options in over 70.5% of cases, while five Chinese human annotators showed an 87.5% alignment with C-VARC, confirming its universality, cultural relevance, and strong alignment with Chinese values. Additionally, we construct 400,000 rule-based moral dilemma scenarios that objectively capture nuanced distinctions in conflicting value prioritization across 17 LLMs. Our work establishes a culturally-adaptive benchmarking framework for comprehensive value evaluation and alignment, representing Chinese characteristics.
中文标题/摘要
标题:C-VARC:一种大规模中文价值规则语料库,用于大型语言模型的价值对齐
确保大型语言模型(LLMs)与主流人类价值观和伦理规范一致,对于人工智能的安全和可持续发展至关重要。当前的价值评估和对齐受到西方文化偏见的限制,依赖于不完整的国内框架和非本土规则;此外,缺乏可扩展的基于规则的场景生成方法使得评估成本高昂且在不同文化背景下不够全面。为应对这些挑战,我们提出了一种基于核心中文价值观的分层价值框架,涵盖三个主要维度、12个核心价值观和50个衍生价值观。基于此框架,我们构建了一个包含超过250,000条价值规则的大型中文价值规则语料库(C-VARC),并通过人工注释进行增强和扩展。实验结果表明,由C-VARC指导的场景在价值边界清晰度和内容多样性方面优于直接生成的场景。在对六个敏感主题(例如代孕、自杀)的评估中,七种主流LLM中有超过70.5%的情况偏好C-VARC生成的选项,而五名中国人工注释者中有87.5%与C-VARC保持一致,证实了其普适性、文化相关性和与中国价值观的强烈一致性。此外,我们构建了400,000个基于规则的道德困境场景,客观地捕捉了17种LLM在冲突价值优先级上的细微差异。我们的工作建立了一个适应文化的基准框架,用于全面的价值评估和对齐,体现了中国特色。
uGMM-NN: Univariate Gaussian Mixture Model Neural Network
Authors: Zakeria Sharif Ali
First: 2025-09-09T10:13:37+00:00 · Latest: 2026-01-02T17:57:38+00:00
Comments: 12 pages, 3 figures
Abstract
This paper introduces the Univariate Gaussian Mixture Model Neural Network (uGMM-NN), a novel neural architecture that embeds probabilistic reasoning directly into the computational units of deep networks. Unlike traditional neurons, which apply weighted sums followed by fixed non-linearities, each uGMM-NN node parameterizes its activations as a univariate Gaussian mixture, with learnable means, variances, and mixing coefficients. This design enables richer representations by capturing multimodality and uncertainty at the level of individual neurons, while retaining the scalability of standard feed-forward networks. We demonstrate that uGMM-NN can achieve competitive discriminative performance compared to conventional multilayer perceptrons, while additionally offering a probabilistic interpretation of activations. The proposed framework provides a foundation for integrating uncertainty-aware components into modern neural architectures, opening new directions for both discriminative and generative modeling.
中文标题/摘要
标题:uGMM-NN:单变量高斯混合模型神经网络
本文介绍了单变量高斯混合模型神经网络(uGMM-NN),这是一种新颖的神经架构,直接将概率推理嵌入到深度网络的计算单元中。与传统的神经元不同,后者应用加权和后跟固定非线性操作,每个uGMM-NN节点将其激活参数化为单变量高斯混合,具有可学习的均值、方差和混合系数。这种设计通过在神经元级别捕获多模态性和不确定性,实现了更丰富的表示,同时保持了标准前馈网络的可扩展性。我们证明,uGMM-NN在判别性能方面可以与传统的多层感知机竞争,同时还可以提供激活的概率解释。所提出的框架为将不确定性感知组件集成到现代神经架构中提供了基础,为判别和生成建模开辟了新的方向。
Summary / 总结
The paper presents uGMM-NN, a neural network architecture that integrates probabilistic reasoning into its nodes. Unlike conventional neurons, uGMM-NN nodes parameterize activations using univariate Gaussian mixtures, allowing for multimodal and uncertain representations. Experiments show that uGMM-NN achieves comparable performance to traditional multilayer perceptrons while providing a probabilistic interpretation of activations, making it suitable for uncertainty-aware modeling in both discriminative and generative tasks.
该研究引入了uGMM-NN,一种将概率推理集成到神经元中的网络,能够实现更丰富的表示和不确定性捕捉。与传统神经元不同,uGMM-NN节点使用单变量高斯混合来参数化激活。实验表明,uGMM-NN在性能上与传统的多层感知机相当,同时提供了激活的概率解释,使其成为不确定性感知神经架构的基础。
RadarPLM: Adapting Pre-trained Language Models for Marine Radar Target Detection by Selective Fine-tuning
Authors: Qiying Hu, Yaowen Li, Xueqian Wang, Linping Zhang, Junlong Ke, Gang Li, Yu Liu, You He
First: 2025-09-15T16:16:57+00:00 · Latest: 2026-01-02T17:57:13+00:00
Abstract
Recent advances in pre-trained language models (PLMs) have demonstrated their capabilities in capturing universal knowledge, making them promising for radar signal processing applications. Nevertheless, directly fine-tuning PLMs on radar signals is both computationally expensive and prone to overfitting, particularly in low signal-to-clutter ratio (SCR) environments. In this paper, we propose a novel fine-tuning framework for PLM-based marine radar target detection. First, we design a lightweight adaptation module, enabling computationally efficient fine-tuning while preserving the pre-trained model's general knowledge. Second, a novel preference-aware loss is developed to selectively optimize different feature patches based on their online-evaluated learning values, guiding the model to concentrate on those generalizable feature patterns during optimization. Finally, a binary classification head is retrained based on autoencoder network to further enhance detection performance. Experiments on real-world radar data show that the proposed RadarPLM framework yields at least a 6.35% improvement in detection performance over the existing networks under low SCR conditions. Especially, in small training samples cases,the proposed RadarPLM also achieves significant advantage over existing networks owing to the incorporation of the PLM.
中文标题/摘要
标题:RadarPLM:通过选择性微调适应预训练语言模型进行海洋雷达目标检测
预训练语言模型(PLMs)的最新进展表明,它们在捕捉通用知识方面的能力使其在雷达信号处理应用中具有前景。然而,直接在雷达信号上微调PLMs既计算成本高昂又容易过拟合,特别是在低信号与杂波比(SCR)环境中。本文提出了一种基于PLM的海洋雷达目标检测的新微调框架。首先,我们设计了一个轻量级的适应模块,使微调计算高效的同时保留预训练模型的通用知识。其次,开发了一种新颖的偏好感知损失,根据在线评估的学习值选择性地优化不同的特征片段,引导模型在优化过程中集中于那些可泛化的特征模式。最后,基于自编码网络重新训练二元分类头以进一步提高检测性能。实验表明,在低SCR条件下,所提出的RadarPLM框架在检测性能上至少比现有网络提高了6.35%。特别是在小训练样本情况下,由于引入了PLM,所提出的RadarPLM也比现有网络具有显著优势。
Summary / 总结
This paper introduces RadarPLM, a novel fine-tuning framework for pre-trained language models (PLMs) in marine radar target detection. It includes a lightweight adaptation module for efficient fine-tuning and a preference-aware loss to optimize feature patches selectively, focusing on generalizable patterns. The framework also retrained a binary classification head using an autoencoder network. Experiments on real-world radar data demonstrated a 6.35% improvement in detection performance under low signal-to-clutter ratio conditions, especially with small training samples.
论文提出了一种名为RadarPLM的预训练语言模型(PLM)在海洋雷达目标检测中的微调框架。该框架引入了一个轻量级的适应模块以实现高效的微调,并开发了一种偏好感知损失来选择性地优化特征片段。此外,还基于自编码网络重新训练了一个二分类头。实验结果显示,RadarPLM在低信号-杂波比条件下将检测性能提高了至少6.35%,并且在小训练样本情况下表现更优。
Clustering by Denoising: Latent plug-and-play diffusion for single-cell data
Authors: Dominik Meier, Shixing Yu, Sagnik Nandy, Promit Ghosal, Kyra Gan
First: 2025-10-26T21:03:56+00:00 · Latest: 2026-01-02T17:32:29+00:00
Abstract
Single-cell RNA sequencing (scRNA-seq) enables the study of cellular heterogeneity. Yet, clustering accuracy, and with it downstream analyses based on cell labels, remain challenging due to measurement noise and biological variability. In standard latent spaces (e.g., obtained through PCA), data from different cell types can be projected close together, making accurate clustering difficult. We introduce a latent plug-and-play diffusion framework that separates the observation and denoising space. This separation is operationalized through a novel Gibbs sampling procedure: the learned diffusion prior is applied in a low-dimensional latent space to perform denoising, while to steer this process, noise is reintroduced into the original high-dimensional observation space. This unique "input-space steering" ensures the denoising trajectory remains faithful to the original data structure. Our approach offers three key advantages: (1) adaptive noise handling via a tunable balance between prior and observed data; (2) uncertainty quantification through principled uncertainty estimates for downstream analysis; and (3) generalizable denoising by leveraging clean reference data to denoise noisier datasets, and via averaging, improve quality beyond the training set. We evaluate robustness on both synthetic and real single-cell genomics data. Our method improves clustering accuracy on synthetic data across varied noise levels and dataset shifts. On real-world single-cell data, our method demonstrates improved biological coherence in the resulting cell clusters, with cluster boundaries that better align with known cell type markers and developmental trajectories.
中文标题/摘要
标题:去噪聚类:潜在空间插件式扩散方法在单细胞数据中的应用
单细胞RNA测序(scRNA-seq)使细胞异质性的研究成为可能。然而,由于测量噪声和生物变异性,聚类准确性和基于细胞标签的下游分析仍然具有挑战性。在标准潜在空间(例如通过PCA获得)中,不同细胞类型的数据可以被投影得非常接近,这使得准确的聚类变得困难。我们提出了一种潜在空间插件式扩散框架,将观测空间和去噪空间分离。这种分离通过一种新颖的吉布斯采样程序实现:学习到的扩散先验在低维潜在空间中应用于去噪,同时通过将噪声重新引入原始高维观测空间来引导这一过程。这种独特的“输入空间引导”确保了去噪轨迹忠实于原始数据结构。我们的方法具有三个关键优势:(1) 通过可调的先验和观测数据之间的平衡来适应噪声处理;(2) 通过为下游分析提供合理的不确定性估计来进行不确定性量化;(3) 通过利用干净的参考数据去噪更嘈杂的数据集,并通过平均化提高质量,超越训练集。我们在合成数据和真实单细胞基因组数据上评估了鲁棒性。我们的方法在不同噪声水平和数据集变化下提高了合成数据的聚类准确性。在真实世界的单细胞数据中,我们的方法展示了在结果细胞簇中更好的生物学一致性,簇边界更好地与已知细胞类型标记和发育轨迹对齐。
Summary / 总结
The paper addresses the challenge of accurate clustering in single-cell RNA sequencing data due to measurement noise and biological variability. It introduces a latent plug-and-play diffusion framework that separates the observation and denoising space using a novel Gibbs sampling procedure. This method enhances clustering accuracy by reintroducing noise into the original high-dimensional space to steer the denoising process in the latent space, offering adaptive noise handling, uncertainty quantification, and generalizable denoising. Experiments on synthetic and real data show improved clustering accuracy and better alignment with known cell type markers and developmental trajectories.
论文旨在解决单细胞RNA测序数据中由于测量噪声和生物变异性导致的准确聚类难题。它提出了一种潜空间插件式扩散框架,将观测空间和去噪空间分离。通过在低维潜空间中应用学习到的扩散先验,并在高维观测空间中重新引入噪声,该方法确保去噪过程保持对原始数据结构的忠实性。关键实验结果表明,该方法在不同噪声水平和数据集变化的合成数据上提高了聚类准确性,并在真实单细胞数据的聚类中展示了更好的生物学一致性,聚类边界更好地与已知的细胞类型标记和发育轨迹对齐。
Adaptive Learning Guided by Bias-Noise-Alignment Diagnostics
Authors: Akash Samanta, Sheldon Williamson
First: 2025-12-30T19:57:52+00:00 · Latest: 2026-01-02T17:32:09+00:00
Comments: This preprint focuses on the theoretical framework and diagnostic behavior. Comprehensive experimental validation in application-specific settings is deferred to a companion experimental study
Abstract
Learning systems deployed in nonstationary and safety-critical environments often suffer from instability, slow convergence, or brittle adaptation when learning dynamics evolve over time. While modern optimization, reinforcement learning, and meta-learning methods adapt to gradient statistics, they largely ignore the temporal structure of the error signal itself. This paper proposes a diagnostic-driven adaptive learning framework that explicitly models error evolution through a principled decomposition into bias, capturing persistent drift; noise, capturing stochastic variability; and alignment, capturing repeated directional excitation leading to overshoot. These diagnostics are computed online from lightweight statistics of loss or temporal-difference (TD) error trajectories and are independent of model architecture or task domain. We show that the proposed bias-noise-alignment decomposition provides a unifying control backbone for supervised optimization, actor-critic reinforcement learning, and learned optimizers. Within this framework, we introduce three diagnostic-driven instantiations: the Human-inspired Supervised Adaptive Optimizer (HSAO), Hybrid Error-Diagnostic Reinforcement Learning (HED-RL) for actor-critic methods, and the Meta-Learned Learning Policy (MLLP). Under standard smoothness assumptions, we establish bounded effective updates and stability properties for all cases. Representative diagnostic illustrations in actor-critic learning highlight how the proposed signals modulate adaptation in response to TD error structure. Overall, this work elevates error evolution to a first-class object in adaptive learning and provides an interpretable, lightweight foundation for reliable learning in dynamic environments.
中文标题/摘要
标题:基于偏差-噪声-对齐诊断的自适应学习
部署在非平稳和安全关键环境中的学习系统往往在动态变化的动力学学习过程中遭受不稳定性、收敛缓慢或脆弱适应的问题。尽管现代优化、强化学习和元学习方法适应梯度统计,但它们很大程度上忽略了误差信号本身的时间结构。本文提出了一种诊断驱动的自适应学习框架,通过原理性的分解将误差演化显式建模为偏差、捕获持久漂移;噪声、捕获随机变异性;以及对齐、捕获重复的方向性激励导致的超调。这些诊断在线从损失或时差(TD)误差轨迹的轻量级统计中计算得出,并且与模型架构或任务领域无关。我们展示了所提出的偏差-噪声-对齐分解为监督优化、演员-评论家强化学习和学习优化器提供了一个统一的控制框架。在此框架内,我们介绍了三种诊断驱动的实例:人类启发的监督自适应优化器(HSAO)、混合误差-诊断强化学习(HED-RL)用于演员-评论家方法以及元学习学习策略(MLLP)。在标准平滑性假设下,我们为所有情况建立了有界有效更新和稳定性属性。代表性的诊断示例在演员-评论家学习中突出显示了所提出信号如何根据TD误差结构调节适应。总体而言,这项工作将误差演化提升为自适应学习中的一等对象,并为动态环境中的可靠学习提供了一个可解释的、轻量级的基础。
Summary / 总结
This paper introduces an adaptive learning framework that addresses the instability and slow convergence of learning systems in nonstationary environments. It proposes a diagnostic-driven approach that decomposes error evolution into bias, noise, and alignment, which are computed online from loss or temporal-difference error trajectories. The framework includes three diagnostic-driven instantiations: HSAO for supervised optimization, HED-RL for actor-critic methods, and MLLP for learned optimizers. Theoretical analysis shows bounded effective updates and stability properties under smoothness assumptions. The work provides an interpretable and lightweight foundation for reliable learning in dynamic environments.
该论文提出了一种诊断驱动的自适应学习框架,通过将误差演变分解为偏差、噪声和对齐来解决非平稳环境中的不稳定性问题。该框架在线计算这些诊断指标并应用于监督优化、强化学习和学习优化器。主要发现包括引入了HSAO、HED-RL和MLLP,这些方法在光滑性假设下显示出了有界的有效更新和稳定性,并且展示了根据TD误差结构如何调节适应性。
Memory Bank Compression for Continual Adaptation of Large Language Models
Authors: Thomas Katraouras, Dimitrios Rafailidis
First: 2026-01-02T17:22:34+00:00 · Latest: 2026-01-02T17:22:34+00:00
Comments: Accepted to the 41st ACM/SIGAPP Symposium on Applied Computing (SAC '26)
Abstract
Large Language Models (LLMs) have become a mainstay for many everyday applications. However, as data evolve their knowledge quickly becomes outdated. Continual learning aims to update LLMs with new information without erasing previously acquired knowledge. Although methods such as full fine-tuning can incorporate new data, they are computationally expensive and prone to catastrophic forgetting, where prior knowledge is overwritten. Memory-augmented approaches address this by equipping LLMs with a memory bank, that is an external memory module which stores information for future use. However, these methods face a critical limitation, in particular, the memory bank constantly grows in the real-world scenario when large-scale data streams arrive. In this paper, we propose MBC, a model that compresses the memory bank through a codebook optimization strategy during online adaptation learning. To ensure stable learning, we also introduce an online resetting mechanism that prevents codebook collapse. In addition, we employ Key-Value Low-Rank Adaptation in the attention layers of the LLM, enabling efficient utilization of the compressed memory representations. Experiments with benchmark question-answering datasets demonstrate that MBC reduces the memory bank size to 0.3% when compared against the most competitive baseline, while maintaining high retention accuracy during online adaptation learning. Our code is publicly available at https://github.com/Thomkat/MBC.
中文标题/摘要
标题:大型语言模型持续适应的内存银行压缩
大型语言模型(LLMs)已成为许多日常应用的支柱。然而,随着数据的演变,其知识迅速变得过时。持续学习旨在更新LLMs以获取新信息,而不抹去之前获得的知识。尽管全微调等方法可以纳入新数据,但它们计算成本高昂且容易发生灾难性遗忘,即先前的知识被覆盖。通过为LLMs配备一个内存银行,即一个外部内存模块来存储未来使用的数据,增强型内存方法解决了这一问题。然而,这些方法面临一个关键限制,特别是在大规模数据流到达的现实场景中,内存银行不断增长。在本文中,我们提出了一种MBC模型,在在线适应学习过程中通过代码本优化策略压缩内存银行。为了确保稳定学习,我们还引入了一种在线重置机制,防止代码本崩溃。此外,我们还在LLM的注意力层中采用键-值低秩适应,使压缩的内存表示能够高效利用。基准问答数据集的实验表明,与最竞争的基线相比,MBC将内存银行的大小压缩到0.3%,同时在在线适应学习过程中保持高保留准确性。我们的代码已公开发布在https://github.com/Thomkat/MBC。
Summary / 总结
This paper addresses the challenge of continual adaptation of Large Language Models (LLMs) by proposing MBC, which compresses the memory bank through codebook optimization and introduces an online resetting mechanism to prevent codebook collapse. MBC also employs Key-Value Low-Rank Adaptation in attention layers to efficiently utilize compressed memory representations. Experiments show that MBC reduces the memory bank size to 0.3% compared to the most competitive baseline while maintaining high retention accuracy during online adaptation learning.
本文提出了一种名为MBC的方法,通过在线适应学习期间的代码本优化来压缩记忆银行,解决了大型语言模型(LLMs)的持续学习挑战。该方法引入了在线重置机制以防止代码本崩溃,并在注意力层中采用了键值低秩适应。实验结果显示,MBC将记忆银行的大小压缩到0.3%,同时在在线适应学习过程中保持了高保留准确性。
The Reasoning-Creativity Trade-off: Toward Creativity-Driven Problem Solving
Authors: Max Ruiz Luyten, Mihaela van der Schaar
First: 2026-01-02T17:10:31+00:00 · Latest: 2026-01-02T17:10:31+00:00
Comments: 56 pages, 9 figures, submitted to Twenty-Ninth Annual Conference on Artificial Intelligence and Statistics
Abstract
State-of-the-art large language model (LLM) pipelines rely on bootstrapped reasoning loops: sampling diverse chains of thought and reinforcing the highest-scoring ones, mainly optimizing correctness. We analyze how this design choice is sensitive to the collapse of the model's distribution over reasoning paths, slashing semantic entropy and undermining creative problem-solving. To analyze this failure, we introduce Distributional Creative Reasoning (DCR), a unified variational objective that casts training as gradient flow through probability measures on solution traces. STaR, GRPO, and DPO, as well as entropy bonuses, and other methods, all constitute special cases of the same loss. The framework delivers three core results: (i) the diversity decay theorem, describing how correctness-based objectives lead to distinct modes of diversity decay for STaR, GRPO, and DPO; (ii) designs that ensure convergence to a stable and diverse policy, effectively preventing collapse; and (iii) simple, actionable recipes to achieve this in practice. DCR thus offers the first principled recipe for LLMs that remain both correct and creative.
中文标题/摘要
标题:推理-创造力权衡:朝向创造力驱动的问题解决
当前最先进的大型语言模型(LLM)流水线依赖于自举推理循环:采样多样性的思维链并强化得分最高的,主要优化正确性。我们分析了这种设计选择对模型推理路径分布坍缩的敏感性,削减了语义熵并削弱了创造性问题解决。为了分析这种失败,我们引入了分布性创造性推理(DCR),这是一种统一的变分目标,将训练视为通过解空间概率测度的梯度流。STaR、GRPO和DPO,以及熵奖励,以及其他方法,都是相同损失的特殊情况。该框架提供了三个核心结果:(i)正确性目标导致STaR、GRPO和DPO不同模式的多样性衰减的多样性衰减定理;(ii)确保收敛到稳定且多样策略的设计,有效防止坍缩;以及(iii)实现这一点的简单、可操作的食谱。因此,DCR提供了第一个原理性的食谱,使LLM保持正确性和创造性。
Summary / 总结
The paper addresses the trade-off between reasoning and creativity in large language models (LLMs), focusing on how correctness-driven optimization can limit creativity. It introduces Distributional Creative Reasoning (DCR) as a unified variational objective that enhances diversity in reasoning paths. Key findings include the diversity decay theorem, which explains how correctness-focused methods reduce diversity, and designs that prevent this collapse, ensuring both correctness and creativity in LLMs.
论文探讨了大型语言模型(LLM)在推理与创造力之间的权衡,关注如何以正确性为导向的优化会限制创造性问题解决。引入了分布性创造性推理(DCR),一种统一的变分目标,来分析和缓解这一问题。关键发现包括多样性衰减定理,解释了正确性导向的目标如何减少多样性,并提出了确保稳定和多样化策略的设计,以防止模型坍缩。提供了保持正确性和创造性的实用方法。
Modeling the One-to-Many Property in Open-Domain Dialogue with LLMs
Authors: Jing Yang Lee, Kong-Aik Lee, Woon-Seng Gan
First: 2025-06-18T04:19:33+00:00 · Latest: 2026-01-02T17:03:31+00:00
Abstract
Open-domain Dialogue (OD) exhibits a one-to-many (o2m) property, whereby multiple appropriate responses exist for a single dialogue context. Despite prior research showing that modeling this property boosts response diversity, most modern LLM-based dialogue agents do not explicitly do so. In this work, we model the o2m property of OD in LLMs by decomposing OD generation into two key tasks: Multi-Response Generation (MRG) and Preference-based Selection (PS), which entail generating a set of n semantically and lexically diverse high-quality responses for a given dialogue context, followed by selecting a single response based on human preference, respectively. To facilitate MRG and PS, we introduce o2mDial, a dialogue corpus explicitly designed to capture the o2m property by featuring multiple plausible responses for each context. Leveraging o2mDial, we propose new in-context learning and instruction-tuning strategies, as well as novel evaluation metrics for MRG, alongside a model-based approach for PS. Empirical results demonstrate that applying the proposed two-stage framework to smaller LLMs for OD generation enhances overall response diversity while maintaining contextual coherence, improving response quality by up to 90%, bringing them closer to the performance of larger models.
中文标题/摘要
标题:使用大语言模型在开放域对话中建模一到多属性
开放域对话(OD)表现出一到多(o2m)属性,即对于一个对话背景,存在多个合适的回应。尽管先前的研究表明建模这种属性可以提升回应的多样性,但大多数基于大语言模型的对话代理并未明确这样做。在本文中,我们通过将OD生成分解为两个关键任务——多回应生成(MRG)和基于偏好的选择(PS)来在大语言模型中建模OD的o2m属性:为给定的对话背景生成一组n个在语义和词汇上多样化且高质量的回应,然后基于人类偏好选择一个回应。为了促进MRG和PS,我们引入了o2mDial,这是一个明确设计用于捕捉o2m属性的对话语料库,每个背景都有多个合理的回应。利用o2mDial,我们提出了新的上下文学习和指令调优策略,以及用于MRG的新评价指标,并提出了一种基于模型的方法来实现PS。实验证明,将提出的两阶段框架应用于较小的大语言模型进行OD生成,可以提高整体回应的多样性,同时保持上下文连贯性,将回应质量提高高达90%,使其更接近大型模型的性能。
Summary / 总结
This study addresses the one-to-many (o2m) property in open-domain dialogue, where multiple appropriate responses can be generated for a single context. To model this property, the authors decompose dialogue generation into Multi-Response Generation (MRG) and Preference-based Selection (PS). They introduce o2mDial, a dialogue corpus with multiple plausible responses for each context, and propose new in-context learning and instruction-tuning strategies. Empirical results show that using the two-stage framework improves response diversity and quality, enhancing contextual coherence and achieving up to 90% better response quality compared to existing methods.
该研究通过将生成过程分解为多响应生成(MRG)和偏好选择(PS)来解决开放领域对话的一对多(o2m)特性。作者引入了o2mDial对话数据集,每个上下文包含多个合理的响应,并提出了新的上下文学习和指令调优策略。实验证明,这种两阶段框架可以提高响应多样性,增强上下文连贯性,并将响应质量提高高达90%,使其接近更大模型的表现。
An Agentic Framework for Neuro-Symbolic Programming
Authors: Aliakbar Nafar, Chetan Chigurupati, Danial Kamali, Hamid Karimian, Parisa Kordjamshidi
First: 2026-01-02T16:59:39+00:00 · Latest: 2026-01-02T16:59:39+00:00
Abstract
Integrating symbolic constraints into deep learning models could make them more robust, interpretable, and data-efficient. Still, it remains a time-consuming and challenging task. Existing frameworks like DomiKnowS help this integration by providing a high-level declarative programming interface, but they still assume the user is proficient with the library's specific syntax. We propose AgenticDomiKnowS (ADS) to eliminate this dependency. ADS translates free-form task descriptions into a complete DomiKnowS program using an agentic workflow that creates and tests each DomiKnowS component separately. The workflow supports optional human-in-the-loop intervention, enabling users familiar with DomiKnowS to refine intermediate outputs. We show how ADS enables experienced DomiKnowS users and non-users to rapidly construct neuro-symbolic programs, reducing development time from hours to 10-15 minutes.
中文标题/摘要
标题:一种代理框架下的神经符号编程
将符号约束整合到深度学习模型中可以使模型更加稳健、可解释和数据高效。然而,这一过程仍然耗时且具有挑战性。现有的框架如DomiKnowS通过提供高层次的声明式编程接口来帮助这一整合,但它们仍然假设用户熟悉该库的特定语法。我们提出了一种代理DomiKnowS(ADS)来消除这种依赖。ADS通过代理工作流将自由形式的任务描述翻译成完整的DomiKnowS程序,该工作流分别创建并测试每个DomiKnowS组件。工作流支持可选的人机交互干预,使熟悉DomiKnowS的用户能够细化中间输出。我们展示了ADS如何使有经验的DomiKnowS用户和非用户能够快速构建神经符号程序,将开发时间从数小时缩短到10-15分钟。
Summary / 总结
The paper aims to integrate symbolic constraints into deep learning models to enhance robustness, interpretability, and data efficiency. It introduces AgenticDomiKnowS (ADS), which automates the creation of DomiKnowS programs from free-form task descriptions, reducing dependency on the user's familiarity with the library's syntax. Key findings show that ADS enables both experienced and inexperienced users to construct neuro-symbolic programs in significantly less time, ranging from hours to just 10-15 minutes.
研究旨在将符号约束集成到深度学习模型中,以提高其鲁棒性、可解释性和数据效率。提出的AgenticDomiKnowS (ADS)框架能够从自由形式的任务描述自动生成DomiKnowS程序,并允许用户通过人机交互干预来细化中间输出。关键发现表明,ADS将开发时间从数小时缩短到10-15分钟,适用于有经验的用户和无经验的用户。
QUITE: A Query Rewrite System Beyond Rules with LLM Agents
Authors: Yuyang Song, Hanxu Yan, Jiale Lao, Yibo Wang, Yufei Li, Yuanchun Zhou, Jianguo Wang, Mingjie Tang
First: 2025-06-09T11:51:27+00:00 · Latest: 2026-01-02T16:51:25+00:00
Abstract
Query rewrite transforms SQL queries into semantically equivalent forms that run more efficiently. Existing approaches mainly rely on predefined rewrite rules, but they handle a limited subset of queries and can cause performance regressions. This limitation stems from three challenges of rule-based query rewrite: (1) it is hard to discover and verify new rules, (2) fixed rewrite rules do not generalize to new query patterns, and (3) some rewrite techniques cannot be expressed as fixed rules. Motivated by the fact that human experts exhibit significantly better rewrite ability but suffer from scalability, and Large Language Models (LLMs) have demonstrated nearly human-level semantic and reasoning abilities, we propose a new approach of using LLMs to rewrite SQL queries beyond rules. Due to the hallucination problems in LLMs, directly applying LLMs often leads to nonequivalent and suboptimal queries. To address this issue, we propose QUITE (query rewrite), a training-free and feedback-aware system based on LLM agents that rewrites SQL queries into semantically equivalent forms with significantly better performance, covering a broader range of query patterns and rewrite strategies compared to rule-based methods. Firstly, we design a multi-agent framework controlled by a finite state machine (FSM) to equip LLMs with the ability to use external tools and enhance the rewrite process with real-time database feedback. Secondly, we develop a rewrite middleware to enhance the ability of LLMs to generate optimized query equivalents. Finally, we employ a novel hint injection technique to improve execution plans for rewritten queries. Extensive experiments show that QUITE reduces query execution time by up to 35.8% over state-of-the-art approaches and produces 24.1% more rewrites than prior methods, covering query cases that earlier systems did not handle.
中文标题/摘要
标题:QUITE:超越规则的LLM代理查询重写系统
查询重写将SQL查询转换为语义等效形式,以更高效地运行。现有方法主要依赖预定义的重写规则,但只能处理查询的有限子集,并可能导致性能倒退。这种限制源于基于规则的查询重写三个挑战:(1)发现和验证新规则困难,(2)固定的重写规则不能泛化到新的查询模式,(3)一些重写技术无法用固定规则表达。鉴于人类专家在重写方面表现出显著的能力,但面临可扩展性问题,以及大型语言模型(LLMs)在语义和推理能力方面几乎达到人类水平,我们提出了一种新的方法,利用LLMs超越规则重写SQL查询。由于LLMs存在幻觉问题,直接应用LLMs往往会导致非等效和次优查询。为解决这一问题,我们提出了QUITE(查询重写),一种基于LLM代理的无需训练且反馈感知的系统,能够将SQL查询转换为语义等效形式,性能显著提高,涵盖比基于规则方法更广泛的查询模式和重写策略。首先,我们设计了一个由有限状态机(FSM)控制的多代理框架,使LLMs能够使用外部工具,并通过实时数据库反馈增强重写过程。其次,我们开发了一种重写中间件,以增强LLMs生成优化查询等效物的能力。最后,我们采用了一种新颖的提示注入技术,以改进重写查询的执行计划。广泛实验表明,QUITE将查询执行时间减少了高达35.8%,并产生了比先前方法多24.1%的重写,涵盖了早期系统未处理的查询案例。
Summary / 总结
The paper proposes QUITE, a system that uses LLM agents to rewrite SQL queries beyond predefined rules, addressing the limitations of rule-based methods. It introduces a multi-agent framework with real-time database feedback and a rewrite middleware to enhance query optimization. QUITE significantly reduces query execution time by up to 35.8% and generates 24.1% more rewrites than previous methods, covering a broader range of query patterns and strategies.
论文提出QUITE系统,利用LLM代理超越预定义规则重写SQL查询,解决基于规则的方法的局限性。QUITE采用具有实时数据库反馈的多代理框架和重写中间件来增强查询优化能力。实验结果显示,QUITE将查询执行时间最多减少35.8%,并生成比先前方法多24.1%的重写,涵盖早期系统无法处理的查询案例。
Exploring the Performance of Large Language Models on Subjective Span Identification Tasks
Authors: Alphaeus Dmonte, Roland Oruche, Tharindu Ranasinghe, Marcos Zampieri, Prasad Calyam
First: 2026-01-02T16:30:14+00:00 · Latest: 2026-01-02T16:30:14+00:00
Abstract
Identifying relevant text spans is important for several downstream tasks in NLP, as it contributes to model explainability. While most span identification approaches rely on relatively smaller pre-trained language models like BERT, a few recent approaches have leveraged the latest generation of Large Language Models (LLMs) for the task. Current work has focused on explicit span identification like Named Entity Recognition (NER), while more subjective span identification with LLMs in tasks like Aspect-based Sentiment Analysis (ABSA) has been underexplored. In this paper, we fill this important gap by presenting an evaluation of the performance of various LLMs on text span identification in three popular tasks, namely sentiment analysis, offensive language identification, and claim verification. We explore several LLM strategies like instruction tuning, in-context learning, and chain of thought. Our results indicate underlying relationships within text aid LLMs in identifying precise text spans.
中文标题/摘要
标题:大型语言模型在主观文本片段识别任务中的性能探索
识别相关的文本片段对于NLP中的多个下游任务至关重要,因为它有助于模型的可解释性。虽然大多数片段识别方法依赖于相对较小的预训练语言模型(如BERT),但最近有一些方法利用了最新的大型语言模型(LLMs)进行片段识别任务。当前的工作主要集中在显式的片段识别,如命名实体识别(NER),而使用LLMs进行主观片段识别的任务,如方面基于情感分析(ABSA)则被探索不足。在本文中,我们通过评估各种LLMs在情感分析、冒犯语言识别和声明验证这三个流行任务中的文本片段识别性能,填补了这一重要空白。我们探索了几种LLM策略,如指令调优、上下文学习和思维链。我们的结果表明,文本中的潜在关系有助于LLMs识别精确的文本片段。
Summary / 总结
This paper evaluates the performance of large language models (LLMs) on subjective span identification tasks such as sentiment analysis, offensive language identification, and claim verification. The study explores different LLM strategies including instruction tuning, in-context learning, and chain of thought. The results show that the context within the text helps LLMs to identify precise text spans more accurately.
本文评估了大型语言模型(LLMs)在情感分析、网络用语识别和论断验证等主观文本片段识别任务中的表现。研究探索了包括指令调优、上下文学习和推理链在内的不同LLM策略。结果显示,文本中的上下文有助于LLMs更准确地识别精确的文本片段。
The Curse of Depth in Large Language Models
Authors: Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, Shiwei Liu
Venue: NeurIPS 2025
First: 2025-02-09T07:03:36+00:00 · Latest: 2026-01-02T16:15:39+00:00
Comments: Accepted by NeurIPS 2025
Abstract
In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling (LNS), which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Across a wide range of model sizes (130M to 7B), our experiments show that LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at \href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}.
中文标题/摘要
标题:大型语言模型中的深度诅咒
在本文中,我们引入了深度诅咒这一概念,该概念强调并解释了现代大型语言模型(LLMs)中一个最近的观察现象,即近一半的层比预期效果差。我们首先确认了这种现象在Llama、Mistral、DeepSeek和Qwen等最受欢迎的LLM家族中普遍存在。我们的分析,从理论和实证两个方面,确定了LLM中深层层无效的根本原因是广泛使用了预层归一化(Pre-LN)。虽然Pre-LN稳定了Transformer LLM的训练,但其输出方差随着模型深度的增加呈指数增长,这不恰当地导致了深层Transformer块的导数为单位矩阵,因此几乎不参与训练。为了解决这一训练问题,我们提出了层归一化缩放(LNS),该方法通过深度的平方根逆向缩放层归一化的输出方差。这一简单的修改缓解了深层Transformer层的输出方差爆炸问题,提高了它们的贡献。在从130M到7B的各种模型规模下,我们的实验表明,LNS在增强LLM预训练性能方面始终优于之前的归一化和缩放技术。此外,这种改进无缝地延续到了监督微调。所有这些收益都归因于层归一化缩放使深层层在训练期间能够更有效地贡献。我们的代码可在https://github.com/lmsdss/LayerNorm-Scaling找到。
Summary / 总结
This paper introduces the Curse of Depth, a phenomenon observed in Large Language Models (LLMs) where nearly half of the layers are less effective. The authors confirm this issue across various LLMs and identify Pre-Layer Normalization (Pre-LN) as the cause, as it leads to an exponential increase in output variance, making deep layers ineffective. To address this, they propose LayerNorm Scaling (LNS), which scales the variance of the layer normalization inversely by the square root of the depth. Experiments show that LNS improves pre-training performance and fine-tuning across different model sizes, making deeper layers more effective during training.
研究引入了深度诅咒(Curse of Depth)的概念,指出在大型语言模型(LLMs)中,大约一半的层效果不佳。研究确认了这一现象在多种LLM中的普遍存在,并指出预层归一化(Pre-LN)是导致这一问题的原因,因为它会导致输出方差的指数增长,使深层层变得无效。为了解决这个问题,作者提出了层归一化缩放(LNS),通过将层归一化输出的方差逆向缩放为深度的平方根的倒数来缓解这一问题。实验表明,LNS在不同规模的模型中提高了预训练性能,并且也对监督微调有益。
Grading Handwritten Engineering Exams with Multimodal Large Language Models
Authors: Janez Perš, Jon Muhovič, Andrej Košir, Boštjan Murovec
First: 2026-01-02T16:10:08+00:00 · Latest: 2026-01-02T16:10:08+00:00
Comments: 10 pages, 5 figures, 2 tables. Supplementary material available at https://lmi.fe.uni-lj.si/en/janez-pers-2/supplementary-material/
Abstract
Handwritten STEM exams capture open-ended reasoning and diagrams, but manual grading is slow and difficult to scale. We present an end-to-end workflow for grading scanned handwritten engineering quizzes with multimodal large language models (LLMs) that preserves the standard exam process (A4 paper, unconstrained student handwriting). The lecturer provides only a handwritten reference solution (100%) and a short set of grading rules; the reference is converted into a text-only summary that conditions grading without exposing the reference scan. Reliability is achieved through a multi-stage design with a format/presence check to prevent grading blank answers, an ensemble of independent graders, supervisor aggregation, and rigid templates with deterministic validation to produce auditable, machine-parseable reports. We evaluate the frozen pipeline in a clean-room protocol on a held-out real course quiz in Slovenian, including hand-drawn circuit schematics. With state-of-the-art backends (GPT-5.2 and Gemini-3 Pro), the full pipeline achieves $\approx$8-point mean absolute difference to lecturer grades with low bias and an estimated manual-review trigger rate of $\approx$17% at $D_{\max}=40$. Ablations show that trivial prompting and removing the reference solution substantially degrade accuracy and introduce systematic over-grading, confirming that structured prompting and reference grounding are essential.
中文标题/摘要
标题:使用多模态大型语言模型批改工程考试手写试卷
手写STEM考试能够捕捉开放性推理和图表,但人工批改速度慢且难以扩展。我们提出了一种端到端的工作流,使用多模态大型语言模型(LLMs)批改扫描的手写工程测验,同时保留标准考试流程(A4纸,不受约束的学生手写)。讲师仅提供一份手写参考答案(100%)和一组简短的评分规则;参考答案被转换为仅包含文本的摘要,用于条件评分而不暴露参考扫描。通过多阶段设计实现可靠性,包括格式/存在检查以防止批改空白答案,独立评分员的集成,监督员汇总,以及严格的模板和确定性验证以生成可审计、机器可解析的报告。我们使用洁净室协议在斯洛文尼亚的一门课程测验上评估冻结的工作流,包括手绘电路图。使用最先进的后端(GPT-5.2和Gemini-3 Pro),完整的工作流在讲师评分上的平均绝对差异约为8分,偏差低,估计的手动复查触发率为约17%(在Dmax=40时)。消融实验表明,简单的提示和移除参考答案会显著降低准确度并引入系统性高估,确认结构化提示和参考定位是必不可少的。
Summary / 总结
The paper addresses the challenge of grading handwritten engineering exams, which are difficult to scale due to manual grading. It proposes an end-to-end workflow using multimodal large language models to automate the process while maintaining the standard exam format. The system relies on a handwritten reference solution and grading rules to condition the model, ensuring reliability through multi-stage design and ensemble grading. Evaluation on a real Slovenian course quiz shows that the pipeline achieves a mean absolute difference of about 8 points from lecturer grades with low bias and a 17% manual-review trigger rate. Ablations demonstrate the importance of structured prompting and reference grounding for accuracy.
该研究通过开发使用多模态大型语言模型的端到端工作流,解决了批改手写工程考试的挑战。方法侧重于手写参考答案和评分规则,系统通过将参考答案转换为文本摘要并使用多阶段检查来确保可靠性。实验结果显示,该流程在与讲师评分的绝对差异约为8分,偏差较低,并且在最大差异为40时,需要手动复查的比例约为17%。消融研究显示,结构化提示和参考答案的定位对于提高准确性至关重要。
JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation
Authors: Kai Liu, Jungang Li, Yuchong Sun, Shengqiong Wu, Jianzhang Gao, Daoan Zhang, Wei Zhang, Sheng Jin, Sicheng Yu, Geng Zhan, Jiayi Ji, Fan Zhou, Liang Zheng, Shuicheng Yan, Hao Fei, Tat-Seng Chua
Venue: NeurIPS Spotlight
First: 2025-12-28T12:25:43+00:00 · Latest: 2026-01-02T15:48:28+00:00
Comments: Accepted by NeurIPS as a Spotlight paper. Code: https://github.com/JavisVerse/JavisGPT
Abstract
This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for joint audio-video (JAV) comprehension and generation. JavisGPT has a concise encoder-LLM-decoder architecture, which has a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. For instruction tuning, we construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that cover diverse and multi-level comprehension and generation scenarios. On JAV comprehension and generation benchmarks, our experiments show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.
中文标题/摘要
标题:JavisGPT:统一多模态大语言模型用于音视频理解和生成
本文介绍了JavisGPT,这是首个用于联合音视频(JAV)理解和生成的统一多模态大语言模型(MLLM)。JavisGPT具有简洁的编码器-大语言模型-解码器架构,包含一个同步融合模块(SyncFusion)用于时空音视频融合和同步感知可学习查询,以连接预训练的JAV-DiT生成器。这种设计使得能够从多模态指令中实现时间上一致的音视频理解和生成。我们设计了一个有效的三阶段训练管道,包括多模态预训练、音视频微调和大规模指令调优,逐步从现有的视觉语言模型中构建多模态理解和生成。在指令调优方面,我们构建了JavisInst-Omni,这是一个高质量的指令数据集,包含超过20万GPT-4o筛选的音视频文本对话,涵盖了多样性和多层次的理解和生成场景。在音视频理解和生成基准测试中,我们的实验表明JavisGPT在复杂和时间同步的设置中优于现有MLLM。
Summary / 总结
JavisGPT is the first unified multimodal large language model designed for joint audio-video comprehension and generation. It uses an encoder-LLM-decoder architecture with a SyncFusion module for audio-video fusion and synchrony-aware queries. The model was trained in three stages: multimodal pretraining, audio-video fine-tuning, and instruction tuning with a large dataset. Experiments show that JavisGPT outperforms existing models, especially in complex and temporally synchronized settings.
JavisGPT 是一种统一的多模态大型语言模型,用于联合音频-视频理解和生成。它采用编码器-LLM-解码器架构,并包含一个 SyncFusion 模块进行时空融合和同步感知查询。该模型通过包括多模态预训练、音频-视频微调和指令调优的三阶段训练管道进行训练。关键实验结果表明,JavisGPT 在复杂和时间同步的设置中优于现有模型。
Detecting Performance Degradation under Data Shift in Pathology Vision-Language Model
Authors: Hao Guan, Li Zhou
First: 2026-01-02T15:12:06+00:00 · Latest: 2026-01-02T15:12:06+00:00
Comments: 8 pages, 6 figures
Abstract
Vision-Language Models have demonstrated strong potential in medical image analysis and disease diagnosis. However, after deployment, their performance may deteriorate when the input data distribution shifts from that observed during development. Detecting such performance degradation is essential for clinical reliability, yet remains challenging for large pre-trained VLMs operating without labeled data. In this study, we investigate performance degradation detection under data shift in a state-of-the-art pathology VLM. We examine both input-level data shift and output-level prediction behavior to understand their respective roles in monitoring model reliability. To facilitate systematic analysis of input data shift, we develop DomainSAT, a lightweight toolbox with a graphical interface that integrates representative shift detection algorithms and enables intuitive exploration of data shift. Our analysis shows that while input data shift detection is effective at identifying distributional changes and providing early diagnostic signals, it does not always correspond to actual performance degradation. Motivated by this observation, we further study output-based monitoring and introduce a label-free, confidence-based degradation indicator that directly captures changes in model prediction confidence. We find that this indicator exhibits a close relationship with performance degradation and serves as an effective complement to input shift detection. Experiments on a large-scale pathology dataset for tumor classification demonstrate that combining input data shift detection and output confidence-based indicators enables more reliable detection and interpretation of performance degradation in VLMs under data shift. These findings provide a practical and complementary framework for monitoring the reliability of foundation models in digital pathology.
中文标题/摘要
标题:病理视觉语言模型在数据偏移下性能退化的检测
视觉语言模型在医学图像分析和疾病诊断中展现了强大的潜力。然而,在部署后,当输入数据分布从开发期间的变化时,它们的性能可能会下降。检测这种性能退化对于临床可靠性至关重要,但对大型预训练VLMs来说,它们在没有标注数据的情况下运行,这使得检测变得具有挑战性。在本研究中,我们探讨了在先进病理VLM中数据偏移下性能退化的检测。我们研究了输入级数据偏移和输出级预测行为,以了解它们在监控模型可靠性中的各自作用。为了便于系统分析输入数据偏移,我们开发了DomainSAT,一个轻量级的图形界面工具箱,集成了代表性偏移检测算法,使数据偏移的直观探索成为可能。我们的分析表明,虽然输入数据偏移检测在识别分布变化和提供早期诊断信号方面是有效的,但它并不总是与实际性能退化相对应。受此观察的启发,我们进一步研究了基于输出的监控,并引入了一个无标签、基于置信度的退化指标,直接捕捉模型预测置信度的变化。我们发现,该指标与性能退化之间存在密切关系,并且可以作为输入偏移检测的有效补充。在大规模病理数据集上的肿瘤分类实验表明,结合输入数据偏移检测和基于输出置信度的指标,可以更可靠地检测和解释VLMs在数据偏移下的性能退化。这些发现为监测数字病理学中基础模型的可靠性提供了一个实用且互补的框架。
Summary / 总结
This study investigates performance degradation in a state-of-the-art pathology vision-language model under data shift. It develops DomainSAT, a lightweight toolbox for detecting input-level data shift and introduces a label-free, confidence-based degradation indicator for output-level monitoring. The research finds that combining these methods provides a more reliable framework for detecting and interpreting performance degradation in VLMs under data shift, enhancing clinical reliability in digital pathology applications.
该研究探讨了病理视觉-语言模型在数据偏移下的性能退化问题,开发了DomainSAT轻量级工具箱来分析输入级数据偏移,并引入了基于标签的、捕获模型预测置信度变化的退化指标来监控输出级行为。研究发现,结合输入数据偏移检测和输出置信度基指标,可以更可靠地检测和解释VLMs在数据偏移下的性能退化,为数字病理学中监控模型可靠性提供了实用框架。
SpiderGen: Towards Procedure Generation For Carbon Life Cycle Assessments with Generative AI
Authors: Anupama Sitaraman, Bharathan Balaji, Yuvraj Agarwal
First: 2025-11-11T17:43:37+00:00 · Latest: 2026-01-02T14:43:37+00:00
Abstract
Investigating the effects of climate change and global warming caused by GHG emissions have been a key concern worldwide. These emissions are largely contributed to by the production, use and disposal of consumer products. Thus, it is important to build tools to estimate the environmental impact of consumer goods, an essential part of which is conducting Life Cycle Assessments (LCAs). LCAs specify and account for the appropriate processes involved with the production, use, and disposal of the products. We present SpiderGen, an LLM-based workflow which integrates the taxonomy and methodology of traditional LCA with the reasoning capabilities and world knowledge of LLMs to generate graphical representations of the key procedural information used for LCA, known as Product Category Rules Process Flow Graphs (PCR PFGs). We additionally evaluate the output of SpiderGen by comparing it with 65 real-world LCA documents. We find that SpiderGen provides accurate LCA process information that is either fully correct or has minor errors, achieving an F1-Score of 65% across 10 sample data points, as compared to 53% using a one-shot prompting method. We observe that the remaining errors occur primarily due to differences in detail between LCA documents, as well as differences in the "scope" of which auxiliary processes must also be included. We also demonstrate that SpiderGen performs better than several baselines techniques, such as chain-of-thought prompting and one-shot prompting. Finally, we highlight SpiderGen's potential to reduce the human effort and costs for estimating carbon impact, as it is able to produce LCA process information for less than \$1 USD in under 10 minutes as compared to the status quo LCA, which can cost over \$25000 USD and take up to 21-person days.
中文标题/摘要
标题:SpiderGen:利用生成式AI进行碳生命周期评估程序生成的研究
全球对气候变化和温室气体排放导致的全球变暖影响的调查已成为关键关注点。这些排放主要来源于消费者产品的生产、使用和处置。因此,建立工具以估算消费品的环境影响至关重要,其中重要的一部分是进行生命周期评估(LCAs)。LCAs明确规定并核算了产品生产、使用和处置过程中涉及的适当过程。我们提出了基于LLM的工作流SpiderGen,该工作流将传统LCAs的分类和方法与LLM的推理能力和世界知识相结合,生成用于LCAs的关键程序信息的图形表示,即产品类别规则过程流程图(PCR PFGs)。我们还通过将SpiderGen的输出与65份实际的LCAs文档进行比较来评估其输出。我们发现,SpiderGen提供的LCAs过程信息要么完全正确,要么只有小错误,其F1分数在10个样本数据点中达到65%,而使用一次提示方法则为53%。我们观察到,剩余的错误主要由于LCAs文档之间的细节差异以及必须包括的辅助过程范围不同所致。我们还证明了SpiderGen在多个基准技术(如链式思考提示和一次提示)中表现更优。最后,我们强调了SpiderGen减少估算碳影响所需的人力和成本的潜力,因为它可以在不到10分钟内以不到1美元的成本生成LCAs过程信息,而现状的LCAs成本可能超过25000美元,需要21个人日。
Summary / 总结
SpiderGen is a generative AI workflow that integrates traditional Life Cycle Assessment (LCA) methods with the reasoning capabilities of large language models (LLMs) to generate accurate graphical representations of Product Category Rules Process Flow Graphs (PCR PFGs). Evaluations show that SpiderGen achieves an F1-Score of 65% compared to 53% using one-shot prompting, with errors mainly due to differences in detail and scope between LCA documents. SpiderGen is more cost-effective and time-efficient, producing LCA process information for less than $1 USD in under 10 minutes compared to the current LCA costing over $25,000 USD and taking up to 21-person days.
SpiderGen 是一种结合传统生命周期评估 (LCA) 的分类和方法与大型语言模型 (LLM) 的推理能力的工作流,生成准确的产品类别规则过程流程图 (PCR PFG) 的图形表示。评估显示,SpiderGen 的 F1-Score 达到 65%,而单次提示方法仅为 53%,主要错误源于 LCA 文档之间细节和范围的差异。SpiderGen 显著降低了生命周期评估的成本和时间,只需不到 10 分钟和不到 1 美元即可生成 LCA 过程信息,而传统 LCA 则需要 25,000 美元和多达 21 人天。
Bayesian Inverse Games with High-Dimensional Multi-Modal Observations
Authors: Yash Jain, Xinjie Liu, Lasse Peters, David Fridovich-Keil, Ufuk Topcu
First: 2026-01-02T14:23:38+00:00 · Latest: 2026-01-02T14:23:38+00:00
Abstract
Many multi-agent interaction scenarios can be naturally modeled as noncooperative games, where each agent's decisions depend on others' future actions. However, deploying game-theoretic planners for autonomous decision-making requires a specification of all agents' objectives. To circumvent this practical difficulty, recent work develops maximum likelihood techniques for solving inverse games that can identify unknown agent objectives from interaction data. Unfortunately, these methods only infer point estimates and do not quantify estimator uncertainty; correspondingly, downstream planning decisions can overconfidently commit to unsafe actions. We present an approximate Bayesian inference approach for solving the inverse game problem, which can incorporate observation data from multiple modalities and be used to generate samples from the Bayesian posterior over the hidden agent objectives given limited sensor observations in real time. Concretely, the proposed Bayesian inverse game framework trains a structured variational autoencoder with an embedded differentiable Nash game solver on interaction datasets and does not require labels of agents' true objectives. Extensive experiments show that our framework successfully learns prior and posterior distributions, improves inference quality over maximum likelihood estimation-based inverse game approaches, and enables safer downstream decision-making without sacrificing efficiency. When trajectory information is uninformative or unavailable, multimodal inference further reduces uncertainty by exploiting additional observation modalities.
中文标题/摘要
标题:高维多模态观测的贝叶斯逆博弈
许多多智能体交互场景可以自然地建模为非合作博弈,其中每个智能体的决策依赖于其他智能体的未来行动。然而,为自主决策部署博弈论规划器需要明确所有智能体的目标。为克服这一实际困难,近期的工作开发了最大似然技术来解决逆博弈问题,可以从交互数据中识别出未知智能体的目标。不幸的是,这些方法只能推断出点估计,而不量化估计器的不确定性;相应地,下游规划决策可能会过于自信地采取不安全的行动。我们提出了一种近似贝叶斯推理方法来解决逆博弈问题,可以结合多模态观测数据,并在有限的传感器观测下实时生成给定隐藏智能体目标的贝叶斯后验分布的样本。具体而言,所提出的贝叶斯逆博弈框架在交互数据集上训练了一个嵌入可微纳什博弈求解器的结构化变分自编码器,并不需要智能体真实目标的标签。广泛的实验表明,我们的框架成功地学习了先验和后验分布,相比基于最大似然估计的逆博弈方法提高了推理质量,并在不牺牲效率的情况下使下游决策更加安全。当轨迹信息不具信息性或不可用时,多模态推理进一步通过利用其他观测模态来减少不确定性。
Summary / 总结
The paper addresses the challenge of autonomous decision-making in multi-agent scenarios by developing a Bayesian inverse game framework. This framework uses a structured variational autoencoder with a differentiable Nash game solver to infer the hidden objectives of agents from interaction data, without requiring labeled data. Key findings include improved inference quality and reduced uncertainty in decision-making, especially when trajectory information is limited or unavailable.
论文通过开发贝叶斯逆博弈框架来解决多智能体场景中的自主决策问题。该框架使用近似贝叶斯推理方法从多模态观测中推断隐藏的智能体目标,并提供不确定性量化,从而实现更安全的规划决策。实验表明,所提出的方法在推理质量上优于基于最大似然估计的逆博弈方法,并能够在不牺牲效率的情况下实现更安全的下游决策。
PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective
Authors: Tim Tsz-Kit Lau, Qi Long, Weijie Su
First: 2025-05-27T22:11:21+00:00 · Latest: 2026-01-02T14:20:32+00:00
Abstract
The ever-growing scale of deep learning models and training data underscores the critical importance of efficient optimization methods. While preconditioned gradient methods such as Adam and AdamW are the de facto optimizers for training neural networks and large language models, structure-aware preconditioned optimizers like Shampoo and Muon, which utilize the matrix structure of gradients, have demonstrated promising evidence of faster convergence. In this paper, we introduce a unifying framework for analyzing "matrix-aware" preconditioned methods, which not only sheds light on the effectiveness of Muon and related optimizers but also leads to a class of new structure-aware preconditioned methods. A key contribution of this framework is its precise distinction between preconditioning strategies that treat neural network weights as vectors (addressing curvature anisotropy) versus those that consider their matrix structure (addressing gradient anisotropy). This perspective provides new insights into several empirical phenomena in language model pre-training, including Adam's training instabilities, Muon's accelerated convergence, and the necessity of learning rate warmup for Adam. Building upon this framework, we introduce PolarGrad, a new class of preconditioned optimization methods based on the polar decomposition of matrix-valued gradients. As a special instance, PolarGrad includes Muon with updates scaled by the nuclear norm of the gradients. We provide numerical implementations of these methods, leveraging efficient numerical polar decomposition algorithms for enhanced convergence. Our extensive evaluations across diverse matrix optimization problems and language model pre-training tasks demonstrate that PolarGrad outperforms both Adam and Muon.
中文标题/摘要
标题:PolarGrad:一种从统一预条件化视角出发的矩阵梯度优化器
深度学习模型和训练数据的不断增长凸显了高效优化方法的重要性。尽管Adam和AdamW等预条件化梯度方法是训练神经网络和大型语言模型的事实上的优化器,但利用梯度矩阵结构的结构感知预条件化优化器,如Shampoo和Muon,已经展示了更快收敛的有希望的证据。在本文中,我们提出了一种统一的框架来分析“矩阵感知”预条件化方法,不仅阐明了Muon及其相关优化器的有效性,还导致了一类新的结构感知预条件化方法。该框架的一个关键贡献是它精确区分了将神经网络权重视为向量(解决曲率各向异性)与考虑其矩阵结构(解决梯度各向异性)的预条件化策略。这种视角为语言模型预训练中的几个经验现象提供了新的见解,包括Adam的训练不稳定性、Muon的加速收敛以及Adam的学习率预热的必要性。基于该框架,我们引入了PolarGrad,这是一种基于矩阵值梯度的极分解的新类预条件化优化方法。作为特殊情况,PolarGrad 包括用梯度核范数缩放的Muon。我们利用高效的数值极分解算法提供了这些方法的数值实现,以增强收敛性。我们在各种矩阵优化问题和语言模型预训练任务中的广泛评估表明,PolarGrad 在性能上优于Adam和Muon。
Summary / 总结
This paper introduces PolarGrad, a unifying framework for analyzing matrix-aware preconditioned optimization methods, which leads to a new class of structure-aware preconditioned methods. The framework distinguishes between preconditioning strategies that address curvature anisotropy and those that address gradient anisotropy, providing insights into the effectiveness of existing methods like Adam and Muon. PolarGrad, based on the polar decomposition of matrix-valued gradients, outperforms both Adam and Muon in various optimization and language model pre-training tasks.
本文提出了一种统一的矩阵感知预条件优化方法框架PolarGrad,区分了针对曲率各向异性与梯度各向异性策略。基于此框架,开发了PolarGrad,包括Muon并使用梯度的核范数进行更新。广泛评估表明,PolarGrad在各种任务中均优于Adam和Muon,展示了其在训练神经网络和大型语言模型时更快收敛和更好稳定性的潜力。
A Vision-and-Knowledge Enhanced Large Language Model for Generalizable Pedestrian Crossing Behavior Inference
Authors: Qingwen Pu, Kun Xie, Hong Yang, Guocong Zhai
First: 2026-01-02T14:13:28+00:00 · Latest: 2026-01-02T14:13:28+00:00
Abstract
Existing paradigms for inferring pedestrian crossing behavior, ranging from statistical models to supervised learning methods, demonstrate limited generalizability and perform inadequately on new sites. Recent advances in Large Language Models (LLMs) offer a shift from numerical pattern fitting to semantic, context-aware behavioral reasoning, yet existing LLM applications lack domain-specific adaptation and visual context. This study introduces Pedestrian Crossing LLM (PedX-LLM), a vision-and-knowledge enhanced framework designed to transform pedestrian crossing inference from site-specific pattern recognition to generalizable behavioral reasoning. By integrating LLaVA-extracted visual features with textual data and transportation domain knowledge, PedX-LLM fine-tunes a LLaMA-2-7B foundation model via Low-Rank Adaptation (LoRA) to infer crossing decisions. PedX-LLM achieves 82.0% balanced accuracy, outperforming the best statistical and supervised learning methods. Results demonstrate that the vision-augmented module contributes a 2.9% performance gain by capturing the built environment and integrating domain knowledge yields an additional 4.1% improvement. To evaluate generalizability across unseen environments, cross-site validation was conducted using site-based partitioning. The zero-shot PedX-LLM configuration achieves 66.9% balanced accuracy on five unseen test sites, outperforming the baseline data-driven methods by at least 18 percentage points. Incorporating just five validation examples via few-shot learning to PedX-LLM further elevates the balanced accuracy to 72.2%. PedX-LLM demonstrates strong generalizability to unseen scenarios, confirming that vision-and-knowledge-enhanced reasoning enables the model to mimic human-like decision logic and overcome the limitations of purely data-driven methods.
中文标题/摘要
标题:一种增强视觉与知识的大语言模型以实现行人过街行为的泛化推理
现有的行人过街行为推理范式,从统计模型到监督学习方法,表现出有限的泛化能力,并且在新地点上表现不佳。最近在大语言模型(LLMs)方面的进展提供了一种从数值模式拟合到语义、上下文感知的行为推理的转变,但现有的LLM应用缺乏领域特定的适应性和视觉上下文。本研究引入了行人过街大语言模型(PedX-LLM),这是一种增强视觉与知识的框架,旨在将行人过街推理从特定地点的模式识别转变为可泛化的行为推理。通过结合LLaVA提取的视觉特征、文本数据和交通领域知识,PedX-LLM通过低秩适应(LoRA)微调LLaMA-2-7B基础模型以推断过街决策。PedX-LLM实现了82.0%的平衡准确率,优于最佳统计和监督学习方法。结果表明,增强视觉模块通过捕获环境特征贡献了2.9%的性能提升,结合领域知识进一步提高了4.1%。为了评估在未见过的环境中的泛化能力,使用基于地点的划分进行了跨地点验证。零样本PedX-LLM配置在五个未见过的测试地点上实现了66.9%的平衡准确率,优于基线数据驱动方法至少18个百分点。通过少量样本学习引入五个验证示例进一步将平衡准确率提升至72.2%。PedX-LLM展示了强大的未见过场景的泛化能力,证实了增强视觉与知识推理使模型能够模仿人类的决策逻辑并克服纯数据驱动方法的局限性。
Summary / 总结
This study introduces Pedestrian Crossing LLM (PedX-LLM), a vision-and-knowledge enhanced framework that integrates visual features and domain knowledge to improve the generalizability of pedestrian crossing behavior inference. PedX-LLM uses Low-Rank Adaptation (LoRA) to fine-tune a LLaMA-2-7B foundation model and achieves 82.0% balanced accuracy, outperforming existing methods. The vision-augmented module and domain knowledge contribute 2.9% and 4.1% performance gains, respectively. PedX-LLM shows strong generalizability across unseen environments, achieving 66.9% balanced accuracy on five unseen test sites and 72.2% with few-shot learning, outperforming baseline methods by at least 18 percentage points.
本研究通过引入结合视觉特征和领域知识的Pedestrian Crossing LLM(PedX-LLM),解决了现有方法在行人过街行为推断方面的局限性。PedX-LLM 使用 Low-Rank Adaptation (LoRA) 对 LLaMA-2-7B 模型进行微调,实现了 82.0% 的平衡准确率,优于统计和监督学习方法。视觉增强模块和领域知识分别贡献了 2.9% 和 4.1% 的性能提升。跨站点验证显示,PedX-LLM 在未见过的站点上可以达到 66.9% 的平衡准确率,优于基线数据驱动方法至少 18 个百分点,并且通过少量样本学习进一步提高了平衡准确率。
ARISE: Adaptive Reinforcement Integrated with Swarm Exploration
Authors: Rajiv Chaitanya M, D R Ramesh Babu
First: 2026-01-02T14:09:22+00:00 · Latest: 2026-01-02T14:09:22+00:00
Comments: 12 pages. Accepted for presentation at WCSC 2026
Abstract
Effective exploration remains a key challenge in RL, especially with non-stationary rewards or high-dimensional policies. We introduce ARISE, a lightweight framework that enhances reinforcement learning by augmenting standard policy-gradient methods with a compact swarm-based exploration layer. ARISE blends policy actions with particle-driven proposals, where each particle represents a candidate policy trajectory sampled in the action space, and modulates exploration adaptively using reward-variance cues. While easy benchmarks exhibit only slight improvements (e.g., +0.7% on CartPole-v1), ARISE yields substantial gains on more challenging tasks, including +46% on LunarLander-v3 and +22% on Hopper-v4, while preserving stability on Walker2d and Ant. Under non-stationary reward shifts, ARISE provides marked robustness advantages, outperforming PPO by +75 points on CartPole and improving LunarLander accordingly. Ablation studies confirm that both the swarm component and the adaptive mechanism contribute to the performance. Overall, ARISE offers a simple, architecture-agnostic route to more exploratory and resilient RL agents without altering core algorithmic structures.
中文标题/摘要
标题:ARISE:自适应强化学习与群集探索集成
有效的探索仍然是强化学习(RL)中的一个关键挑战,尤其是在非平稳奖励或高维策略的情况下。我们引入了ARISE,这是一种轻量级框架,通过将紧凑的群集探索层与标准策略梯度方法相结合来增强强化学习。ARISE将策略动作与粒子驱动的提案相结合,其中每个粒子代表在动作空间中采样的候选策略轨迹,并使用奖励方差提示自适应地调节探索。尽管在简单的基准测试中仅表现出轻微的改进(例如,在CartPole-v1上提高了0.7%),但在更具挑战性的任务中,ARISE却取得了显著的提升,包括在LunarLander-v3上提高了46%,在Hopper-v4上提高了22%,同时在Walker2d和Ant上保持了稳定性。在非平稳奖励变化下,ARISE提供了显著的鲁棒性优势,在CartPole上比PPO提高了75分,在LunarLander上也相应地提高了表现。消融研究证实,群集组件和自适应机制都对性能有所贡献。总体而言,ARISE提供了一种简单且架构无关的方法,以实现更具探索性和鲁棒性的RL代理,而不改变核心算法结构。
Summary / 总结
ARISE is a lightweight framework that improves reinforcement learning by integrating a swarm-based exploration layer with standard policy-gradient methods. It enhances exploration by blending policy actions with particle-driven proposals, where each particle represents a candidate policy trajectory. ARISE shows significant improvements on challenging tasks, such as +46% on LunarLander-v3 and +22% on Hopper-v4, while maintaining stability on Walker2d and Ant. It also demonstrates robustness under non-stationary reward shifts, outperforming PPO by +75 points on CartPole. Ablation studies confirm the contributions of both the swarm component and the adaptive mechanism to performance.
ARISE 是一种轻量级的强化学习框架,通过引入基于粒子的探索层来增强标准的策略梯度方法。该方法使用粒子在动作空间中采样候选策略轨迹,并根据奖励方差进行探索调节。ARISE 在 LunarLander-v3 (+46%) 和 Hopper-v4 (+22%) 等具有挑战性的任务上表现出显著改进,同时在 Walker2d 和 Ant 上保持稳定性。此外,它在非平稳奖励变化下表现出色,CartPole 上优于 PPO 75 分。消融研究证实了粒子组件和自适应机制的重要性。
Beyond Accuracy: What Matters in Designing Well-Behaved Image Classification Models?
Authors: Robin Hesse, Doğukan Bağcı, Bernt Schiele, Simone Schaub-Meyer, Stefan Roth
First: 2025-03-21T12:54:18+00:00 · Latest: 2026-01-02T14:05:54+00:00
Comments: Published in TMLR (12/2025) | OpenReview: https://openreview.net/forum?id=E7HDtLCoT6 | Project page: https://visinf.github.io/beyond-accuracy/
Abstract
Deep learning has become an essential part of computer vision, with deep neural networks (DNNs) excelling in predictive performance. However, they often fall short in other critical quality dimensions, such as robustness, calibration, or fairness. While existing studies have focused on a subset of these quality dimensions, none have explored a more general form of "well-behavedness" of DNNs. With this work, we address this gap by simultaneously studying nine different quality dimensions for image classification. Through a large-scale study, we provide a bird's-eye view by analyzing 326 backbone models and how different training paradigms and model architectures affect these quality dimensions. We reveal various new insights such that (i) vision-language models exhibit high class balance on ImageNet-1k classification and strong robustness against domain changes; (ii) training models initialized with weights obtained through self-supervised learning is an effective strategy to improve most considered quality dimensions; and (iii) the training dataset size is a major driver for most of the quality dimensions. We conclude our study by introducing the QUBA score (Quality Understanding Beyond Accuracy), a novel metric that ranks models across multiple dimensions of quality, enabling tailored recommendations based on specific user needs.
中文标题/摘要
标题:超越准确性:设计良好行为的图像分类模型需要考虑什么?
深度学习已成为计算机视觉不可或缺的一部分,深度神经网络(DNNs)在预测性能方面表现出色。然而,它们在其他关键质量维度,如鲁棒性、校准或公平性方面往往表现不佳。虽然现有研究集中在这些质量维度的一部分,但没有研究探索DNNs的更一般形式的“良好行为”。通过这项工作,我们填补了这一空白,同时研究了图像分类中的九种不同质量维度。通过大规模研究,我们分析了326个骨干模型以及不同训练范式和模型架构如何影响这些质量维度。我们揭示了各种新的见解,例如:(i) 视觉语言模型在ImageNet-1k分类中表现出高类别平衡,并且对领域变化具有很强的鲁棒性;(ii) 使用自监督学习获得的权重初始化模型是一种有效策略,可以提高大多数考虑的质量维度;(iii) 数据集大小是大多数质量维度的主要驱动因素。我们通过引入QUBA评分(超越准确性理解的质量),一种多维度质量的新型度量标准,总结了我们的研究,该度量标准可以根据特定用户需求提供定制化建议。
Summary / 总结
This study addresses the gap in understanding the well-behavedness of deep neural networks in image classification by examining nine quality dimensions. Through a large-scale analysis of 326 backbone models, the authors find that vision-language models have high class balance and robustness, self-supervised learning initialization improves most quality dimensions, and dataset size is a major driver for quality. They introduce the QUBA score, a metric that ranks models across multiple quality dimensions for tailored recommendations.
研究通过评估326个骨干模型在九个质量维度上的表现,解决了深度神经网络在准确度之外的局限性,如鲁棒性和公平性。研究发现,视觉语言模型在ImageNet-1k分类中具有高类平衡和强鲁棒性,并且使用自监督学习初始化权重可以提高大多数质量维度。研究还发现,训练数据集的大小对大多数质量维度有显著影响。研究引入了QUBA评分(质量理解超越准确度),这是一种新的指标,用于在多个质量维度上对模型进行排名,从而根据用户需求提供定制化建议。
PrivTune: Efficient and Privacy-Preserving Fine-Tuning of Large Language Models via Device-Cloud Collaboration
Authors: Yi Liu, Weixiang Han, Chengjun Cai, Xingliang Yuan, Cong Wang
First: 2025-12-09T17:03:59+00:00 · Latest: 2026-01-02T14:03:16+00:00
Comments: Accepted at IEEE INFOCOM 2026 (full version)
Abstract
With the rise of large language models, service providers offer language models as a service, enabling users to fine-tune customized models via uploaded private datasets. However, this raises concerns about sensitive data leakage. Prior methods, relying on differential privacy within device-cloud collaboration frameworks, struggle to balance privacy and utility, exposing users to inference attacks or degrading fine-tuning performance. To address this, we propose PrivTune, an efficient and privacy-preserving fine-tuning framework via Split Learning (SL). The key idea of PrivTune is to inject crafted noise into token representations from the SL bottom model, making each token resemble the $n$-hop indirect neighbors. PrivTune formulates this as an optimization problem to compute the optimal noise vector, aligning with defense-utility goals. On this basis, it then adjusts the parameters (i.e., mean) of the $d_χ$-Privacy noise distribution to align with the optimization direction and scales the noise according to token importance to minimize distortion. Experiments on five datasets (covering both classification and generation tasks) against three embedding inversion and three attribute inference attacks show that, using RoBERTa on the Stanford Sentiment Treebank dataset, PrivTune reduces the attack success rate to 10% with only a 3.33% drop in utility performance, outperforming state-of-the-art baselines.
中文标题/摘要
标题:PrivTune:通过设备-云协作的高效且保护隐私的大语言模型微调框架
随着大语言模型的兴起,服务提供商提供语言模型作为服务,使用户能够通过上传的私人数据集微调定制模型。然而,这引发了敏感数据泄露的担忧。先前的方法依赖于设备-云协作框架中的差分隐私,难以在隐私和实用性之间取得平衡,使用户面临推理攻击或微调性能下降的风险。为了解决这一问题,我们提出了PrivTune,一种通过拆分学习(SL)实现的高效且保护隐私的微调框架。PrivTune的核心思想是在SL底层模型的标记表示中注入精心设计的噪声,使每个标记看起来像是$n$跳间接邻居。PrivTune将这一过程表述为一个优化问题,以计算最优噪声向量,与防御-实用性目标相一致。在此基础上,它调整$d_χ$隐私噪声分布的参数(即均值),使其与优化方向一致,并根据标记的重要性缩放噪声以最小化失真。在五个数据集(涵盖分类和生成任务)上与三种嵌入反向工程和三种属性推理攻击的实验表明,使用RoBERTa在斯坦福情感树库数据集上,PrivTune将攻击成功率降低到10%,同时仅使实用性性能下降3.33%,优于最先进的基线。
Summary / 总结
PrivTune is an efficient and privacy-preserving fine-tuning framework for large language models that uses Split Learning to inject crafted noise into token representations, reducing the risk of sensitive data leakage. Experiments show that PrivTune achieves a 10% attack success rate reduction with only a 3.33% drop in utility performance on the Stanford Sentiment Treebank dataset, outperforming existing methods.
PrivTune 是一种通过 Split Learning 提高大型语言模型微调过程中的隐私和效率的框架。它通过向 token 表示中注入噪声来防止推理攻击,并优化性能。实验表明,PrivTune 在各种数据集和任务上的攻击成功率降低到 10%,同时对性能的影响最小,优于现有方法。
History
20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553