arXiv 论文速递

Heterogeneous Low-Bandwidth Pre-Training of LLMs

Authors: Yazan Obeidi, Amir Sarfi, Joel Lidin, Paul Janson, Eugene Belilovsky

First: 2026-01-05T18:59:57+00:00 · Latest: 2026-01-05T18:59:57+00:00

Abstract

Pre-training large language models (LLMs) increasingly requires distributed compute, yet bandwidth constraints make it difficult to scale beyond well-provisioned datacenters-especially when model parallelism forces frequent, large inter-device communications. We study whether SparseLoCo, a low-communication data parallel method based on infrequent synchronization and sparse pseudo-gradient exchange, can be combined with low-bandwidth pipeline model parallelism via activation and activation-gradient compression. We introduce a heterogeneous distributed training framework where some participants host full replicas on high-bandwidth interconnects, while resource-limited participants are grouped to jointly instantiate a replica using pipeline parallelism with subspace-projected inter-stage communication. To make the recently introduced subspace pipeline compression compatible with SparseLoCo, we study a number of adaptations. Across large-scale language modeling experiments (178M-1B parameters) on standard pretraining corpora, we find that activation compression composes with SparseLoCo at modest cost, while selective (heterogeneous) compression consistently improves the loss-communication tradeoff relative to compressing all replicas-especially at aggressive compression ratios. These results suggest a practical path to incorporating low-bandwidth model parallelism and heterogeneous participants into LLM pre-training.

中文标题/摘要

标题：异构低带宽预训练的大型语言模型

预训练大型语言模型（LLMs）越来越多地需要分布式计算，但带宽限制使其难以扩展到超出良好配置的数据中心，尤其是在模型并行性要求频繁进行大量设备间通信时。我们研究了SparseLoCo，一种基于不频繁同步和稀疏伪梯度交换的低通信数据并行方法，是否可以与低带宽管道模型并行结合使用，通过激活和激活梯度压缩。我们介绍了一种异构分布式训练框架，其中一些参与者在高带宽互连上托管完整副本，而资源有限的参与者被分组以联合实例化一个副本，使用管道并行和子空间投影的阶段间通信。为了使最近引入的子空间管道压缩与SparseLoCo兼容，我们研究了多种适应性。在标准预训练语料库（178M-1B参数）的大规模语言建模实验中，我们发现激活压缩与SparseLoCo结合使用成本较低，而选择性（异构）压缩始终在压缩所有副本时改善了损失-通信权衡，尤其是在激进的压缩比下。这些结果表明了一条实用的道路，可以将低带宽模型并行和异构参与者纳入LLM预训练。

Summary / 总结

The research addresses the challenge of pre-training large language models (LLMs) due to bandwidth constraints by combining SparseLoCo, a low-communication data parallel method, with low-bandwidth pipeline model parallelism via activation and activation-gradient compression. The study introduces a heterogeneous distributed training framework where some participants use high-bandwidth interconnects while others use pipeline parallelism with subspace-projected inter-stage communication. Experiments on standard pretraining corpora show that activation compression can be effectively combined with SparseLoCo, and selective compression improves the loss-communication tradeoff, especially at high compression ratios, suggesting a practical approach to incorporating low-bandwidth model parallelism and heterogeneous participants in LLM pre-training.

研究探讨了将SparseLoCo（一种低通信数据并行方法）与低带宽管道模型并行以及激活/激活梯度压缩结合用于大型语言模型预训练的方法。研究引入了一种异构分布式训练框架，其中一些节点使用高带宽互连进行完整模型副本，而其他节点则使用管道并行和子空间投影的阶段间通信。实验表明，激活压缩可以与SparseLoCo有效结合，选择性压缩在高压缩比下可以改善损失-通信权衡，这表明一种实用的方法，可以将低带宽模型并行和异构参与者纳入大型语言模型预训练中。

VINO: A Unified Visual Generator with Interleaved OmniModal Context

Authors: Junyi Chen, Tong He, Zhoujie Fu, Pengfei Wan, Kun Gai, Weicai Ye

First: 2026-01-05T18:56:34+00:00 · Latest: 2026-01-05T18:56:34+00:00

Comments: Project page: https://sotamak1r.github.io/VINO-web/

Abs · PDF · Code1 · Code2 · Project1

Abstract

We present VINO, a unified visual generator that performs image and video generation and editing within a single framework. Instead of relying on task-specific models or independent modules for each modality, VINO uses a shared diffusion backbone that conditions on text, images and videos, enabling a broad range of visual creation and editing tasks under one model. Specifically, VINO couples a vision-language model (VLM) with a Multimodal Diffusion Transformer (MMDiT), where multimodal inputs are encoded as interleaved conditioning tokens, and then used to guide the diffusion process. This design supports multi-reference grounding, long-form instruction following, and coherent identity preservation across static and dynamic content, while avoiding modality-specific architectural components. To train such a unified system, we introduce a multi-stage training pipeline that progressively expands a video generation base model into a unified, multi-task generator capable of both image and video input and output. Across diverse generation and editing benchmarks, VINO demonstrates strong visual quality, faithful instruction following, improved reference and attribute preservation, and more controllable multi-identity edits. Our results highlight a practical path toward scalable unified visual generation, and the promise of interleaved, in-context computation as a foundation for general-purpose visual creation.

中文标题/摘要

标题：VINO：统一视觉生成器，具有交错的全模态上下文

我们提出了VINO，一种统一的视觉生成器，可以在单一框架内进行图像和视频生成与编辑。VINO 不依赖于特定任务的模型或独立的模块，而是使用一个共享的扩散骨干网络，该网络根据文本、图像和视频进行条件化，从而在一个模型中实现广泛的视觉创作和编辑任务。具体而言，VINO 将视觉语言模型（VLM）与多模态扩散变换器（MMDiT）耦合，其中多模态输入被编码为交错的条件化标记，然后用于引导扩散过程。这种设计支持多参考定位、长格式指令跟随以及在静态和动态内容中保持一致的身份，同时避免了特定模态的架构组件。为了训练这样一个统一系统，我们引入了一种多阶段训练管道，该管道逐步扩展视频生成基础模型，使其成为一个能够处理图像和视频输入输出的统一、多任务生成器。在各种生成和编辑基准测试中，VINO 展现出强大的视觉质量、忠实的指令跟随、改进的参考和属性保留以及更可控的多身份编辑。我们的结果突显了可扩展统一视觉生成的实用路径，并展示了交错的上下文计算作为通用视觉创作基础的潜力。

Summary / 总结

VINO is a unified visual generator that integrates image and video generation and editing within a single framework. It uses a shared diffusion backbone conditioned on text, images, and videos, coupled with a multimodal diffusion transformer to support various visual tasks. VINO demonstrates strong visual quality, faithful instruction following, and improved reference and attribute preservation across diverse benchmarks, highlighting its potential for scalable unified visual generation.

VINO 是一个统一的视觉生成器，将图像和视频生成与编辑整合在一个框架中。它使用一个共享的扩散骨干网络，并结合多模态扩散变换器（MMDiT），根据文本、图像和视频进行条件化，支持多种视觉任务。VINO 在不同基准测试中展示了强大的视觉质量、忠实的指令跟随以及改进的参考和属性保留，突显了其在统一视觉生成中的潜在应用。

SpatialBench: Can Agents Analyze Real-World Spatial Biology Data?

Authors: Kenny Workman, Zhen Yang, Harihara Muralidharan, Hannah Le

Venue: NeurIPS 2024

First: 2025-12-26T07:40:11+00:00 · Latest: 2026-01-05T18:55:51+00:00

Comments: 10 pages, 9 figures, 4 tables; NeurIPS 2024 format

Abs · PDF · Code1 · Code2

Abstract

Spatial transcriptomics assays are rapidly increasing in scale and complexity, making computational analysis a major bottleneck in biological discovery. Although frontier AI agents have improved dramatically at software engineering and general data analysis, it remains unclear whether they can extract biological insight from messy, real-world spatial datasets. We introduce SpatialBench, a benchmark of 146 verifiable problems derived from practical spatial analysis workflows spanning five spatial technologies and seven task categories. Each problem provides a snapshot of experimental data immediately prior to an analysis step and a deterministic grader that evaluates recovery of a key biological result. Benchmark data on frontier models shows that base model accuracy remains low (20-38% across model families), with strong model-task and model-platform interactions. Harness design has a large empirical effect on performance, indicating that tools, prompts, control flow, and execution environment should be evaluated and improved as first-class objects. SpatialBench serves both as a measurement tool and a diagnostic lens for developing agents that can interact with real spatial datasets faithfully, transparently, and reproducibly.

中文标题/摘要

标题：SpatialBench：智能体能否分析现实世界的空间生物学数据？

空间转录组学检测正在迅速扩大规模和复杂性，使得计算分析成为生物发现的主要瓶颈。尽管前沿的人工智能智能体在软件工程和通用数据分析方面取得了显著进步，但尚不清楚它们是否能够从杂乱的现实世界空间数据集中提取生物学见解。我们引入了SpatialBench，这是一个包含146个可验证问题的基准，这些问题源自跨越五种空间技术和七个任务类别的实际空间分析工作流。每个问题提供了一幅实验数据的快照，该数据在分析步骤之前立即可用，并提供了一个确定性评分器来评估关键生物学结果的恢复情况。基准数据显示，基础模型的准确性仍然很低（模型家族间20-38%），存在明显的模型-任务和模型-平台交互作用。设计工具对性能有显著影响，表明应将工具、提示、控制流和执行环境作为一等对象进行评估和改进。SpatialBench 既作为测量工具，又作为诊断透镜，用于开发能够忠实、透明和可重复地与真实空间数据集交互的智能体。

Summary / 总结

SpatialBench evaluates the ability of AI agents to analyze real-world spatial biology data, introducing 146 verifiable problems from practical spatial analysis workflows. The benchmark shows that current models have low accuracy (20-38%) and highlights the importance of model-task and model-platform interactions. Performance is significantly influenced by the harness design, suggesting that tools, prompts, control flow, and execution environment need to be carefully evaluated and improved.

SpatialBench 使用来自五个空间技术的 146 个问题评估 AI 代理分析真实世界空间生物学数据的能力。基准测试显示当前模型的准确性较低（20-38%），并且模型与任务/平台之间存在显著的交互作用。研究强调了评估工具、提示、控制流和执行环境作为一等对象的重要性，以提高性能。

DARC: Drum accompaniment generation with fine-grained rhythm control

Authors: Trey Brosnan

First: 2026-01-05T18:55:43+00:00 · Latest: 2026-01-05T18:55:43+00:00

Abs · PDF · Code1 · Code2

Abstract

In music creation, rapid prototyping is essential for exploring and refining ideas, yet existing generative tools often fall short when users require both structural control and stylistic flexibility. Prior approaches in stem-to-stem generation can condition on other musical stems but offer limited control over rhythm, and timbre-transfer methods allow users to specify specific rhythms, but cannot condition on musical context. We introduce DARC, a generative drum accompaniment model that conditions both on musical context from other stems and explicit rhythm prompts such as beatboxing or tapping tracks. Using parameter-efficient fine-tuning, we augment STAGE, a state-of-the-art drum stem generator, with fine-grained rhythm control while maintaining musical context awareness.

中文标题/摘要

标题：DARC：细粒度节奏控制下的鼓伴奏生成

在音乐创作中，快速原型设计对于探索和细化想法至关重要，但现有的生成工具在用户需要结构控制和风格灵活性时往往表现不佳。先前的分轨到分轨生成方法可以条件依赖于其他音乐分轨，但对节奏的控制有限；而音色转移方法允许用户指定特定的节奏，但无法依赖音乐上下文。我们引入了DARC，这是一种既依赖于其他分轨的音乐上下文，又依赖于明确的节奏提示（如说唱节奏或敲击音轨）的生成鼓伴奏模型。通过参数高效的微调，我们增强了STAGE，一种最先进的鼓分轨生成器，同时保持了对音乐上下文的意识，增加了细粒度的节奏控制。

Summary / 总结

DARC is a drum accompaniment generation model that combines musical context from other stems with explicit rhythm prompts, offering fine-grained rhythm control. By using parameter-efficient fine-tuning, DARC augments an existing state-of-the-art drum stem generator, STAGE, to achieve both structural control and stylistic flexibility. The key experimental finding is that DARC can generate drum accompaniments with better rhythm control compared to previous methods while maintaining musical context awareness.

研究旨在通过提供结构控制和节奏灵活性来改进打击乐伴奏生成。DARC是一种生成模型，它基于其他声部的音乐上下文和明确的节奏提示进行条件生成，通过参数高效的微调增强STAGE。主要发现表明，DARC能够有效结合上下文意识与精细的节奏控制，从而更好地支持音乐创作中的快速原型设计。

Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Authors: Jing Tan, Zhaoyang Zhang, Yantao Shen, Jiarui Cai, Shuo Yang, Jiajun Wu, Wei Xia, Zhuowen Tu, Stefano Soatto

First: 2026-01-05T18:55:32+00:00 · Latest: 2026-01-05T18:55:32+00:00

Comments: Project page: https://sparkstj.github.io/talk2move

Abs · PDF · Code1 · Code2 · Project1

Abstract

We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations-such as translating, rotating, or resizing objects-due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations. Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.

中文标题/摘要

标题：Talk2Move：场景中基于文本指令的对象级几何变换的强化学习方法

我们介绍了Talk2Move，一种基于强化学习（RL）的扩散框架，用于场景中基于文本的空间对象变换。通过自然语言在场景中操纵对象对多模态生成系统提出了挑战。现有的基于文本的操纵方法可以调整外观或风格，但在执行对象级几何变换（如平移、旋转或缩放对象）方面存在困难，这主要是由于缺乏配对监督和像素级优化的限制。Talk2Move 使用组相对策略优化（GRPO）通过从输入图像和轻量级文本变化生成的多样化回放探索几何动作，从而消除昂贵的配对数据需求。空间奖励引导模型将几何变换与语言描述对齐，而离策略步骤评估和主动步骤采样通过关注信息性变换阶段提高学习效率。此外，我们设计了以对象为中心的空间奖励，直接评估位移、旋转和缩放行为，使变换具有可解释性和连贯性。在精心策划的基准测试上进行的实验表明，Talk2Move 实现了精确、一致且语义忠实的对象变换，在空间准确性和场景连贯性方面优于现有的基于文本的编辑方法。

Summary / 总结

Talk2Move is a reinforcement learning-based framework that uses natural language instructions to perform object-level geometric transformations in scenes. It addresses the challenge of spatial manipulation through a novel Group Relative Policy Optimization method, which explores geometric actions without requiring paired data. The framework uses a spatial reward model to align transformations with linguistic descriptions and improves learning efficiency by focusing on informative stages. Experiments show that Talk2Move achieves precise and semantically faithful object transformations, outperforming existing methods in spatial accuracy and scene coherence.

Talk2Move 是一个基于强化学习的框架，使用 Group Relative Policy Optimization 在场景中根据自然语言指令执行对象的几何变换。它通过生成多样化的卷出和使用轻量级的文本变体来应对对象级别的几何变换挑战，无需配对监督。模型使用空间奖励来使变换与语言描述对齐，并通过离策策略步骤评估和主动步骤采样提高学习效率。实验表明，Talk2Move 实现了精确且语义上忠实的对象变换，优于现有方法在空间精度和场景连贯性方面的表现。

Explainable AI Technique in Lung Cancer Detection Using Convolutional Neural Networks

Authors: Nishan Rai, Sujan Khatri, Devendra Risal

First: 2025-08-13T21:02:38+00:00 · Latest: 2026-01-05T18:51:53+00:00

Comments: 11 pages, 9 figures, 4 tables. Undergraduate research project report

Abs · PDF · Code1 · Code2

Abstract

Early detection of lung cancer is critical to improving survival outcomes. We present a deep learning framework for automated lung cancer screening from chest computed tomography (CT) images with integrated explainability. Using the IQ-OTH/NCCD dataset (1,197 scans across Normal, Benign, and Malignant classes), we evaluate a custom convolutional neural network (CNN) and three fine-tuned transfer learning backbones: DenseNet121, ResNet152, and VGG19. Models are trained with cost-sensitive learning to mitigate class imbalance and evaluated via accuracy, precision, recall, F1-score, and ROC-AUC. While ResNet152 achieved the highest accuracy (97.3%), DenseNet121 provided the best overall balance in precision, recall, and F1 (up to 92%, 90%, 91%, respectively). We further apply Shapley Additive Explanations (SHAP) to visualize evidence contributing to predictions, improving clinical transparency. Results indicate that CNN-based approaches augmented with explainability can provide fast, accurate, and interpretable support for lung cancer screening, particularly in resource-limited settings.

中文标题/摘要

标题：使用卷积神经网络的肺癌检测可解释人工智能技术

早期检测肺癌对于改善生存结果至关重要。我们提出了一种深度学习框架，用于从胸部计算机断层扫描（CT）图像中自动进行肺癌筛查，并集成了可解释性。使用IQ-OTH/NCCD数据集（1,197次扫描，包括正常、良性、恶性类），我们评估了一个自定义卷积神经网络（CNN）和三种微调迁移学习骨干网络：DenseNet121、ResNet152和VGG19。模型使用成本敏感学习进行训练，以缓解类别不平衡问题，并通过准确率、精确率、召回率、F1分数和ROC-AUC进行评估。虽然ResNet152实现了最高的准确率（97.3%），但DenseNet121在精确率、召回率和F1（分别高达92%、90%、91%）方面提供了最佳的整体平衡。我们进一步应用Shapley加性解释（SHAP）来可视化支持预测的证据，提高临床透明度。结果表明，结合可解释性的CNN方法可以提供快速、准确且可解释的支持，特别是在资源有限的环境中进行肺癌筛查。

Summary / 总结

The research aims to improve the early detection of lung cancer using deep learning techniques with explainability. A custom CNN and three transfer learning backbones (DenseNet121, ResNet152, and VGG19) were evaluated on the IQ-OTH/NCCD dataset. ResNet152 achieved the highest accuracy (97.3%), but DenseNet121 provided the best overall balance in precision, recall, and F1-score. SHAP was used to visualize the evidence contributing to predictions, enhancing clinical transparency. The study suggests that CNN-based approaches with explainability can offer fast, accurate, and interpretable support for lung cancer screening, especially in resource-limited settings.

研究旨在通过结合解释性的深度学习框架来提高肺癌的早期检测。研究在IQ-OTH/NCCD数据集上评估了一个自定义的CNN和三种迁移学习骨干网络。尽管ResNet152达到了最高的准确率，但DenseNet121在精确率、召回率和F1分数上提供了最佳的平衡。使用SHAP可视化预测，增强了临床透明度。结果表明，基于CNN的可解释方法可以提供快速、准确且可解释的支持，特别是在资源有限的环境中进行肺癌筛查。

Scaling Open-Ended Reasoning to Predict the Future

Authors: Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping

First: 2025-12-31T18:59:51+00:00 · Latest: 2026-01-05T18:45:47+00:00

Comments: 45 pages

Abs · PDF · Code1 · Code2

Abstract

High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We open-source all our models, code, and data to make research on language model forecasting broadly accessible.

中文标题/摘要

标题：将开放性推理扩展以预测未来

高风险决策涉及对未来不确定性的推理。在本研究中，我们训练语言模型对开放性预测问题进行预测。为了扩大训练数据，我们从每日新闻中报道的全球事件中合成新型预测问题，采用完全自动化的仔细编纂配方。我们在OpenForesight数据集上训练Qwen3思考模型。为了防止训练和评估期间出现未来信息泄露，我们在数据生成和检索中使用离线新闻语料库。在一小部分验证集的引导下，我们展示了检索的好处以及强化学习（RL）中改进的奖励函数。一旦我们获得最终的预测系统，我们将在2025年5月至8月之间进行保留测试。我们的专门模型OpenForecaster 8B与更大规模的专有模型相当，我们的训练提高了预测的准确性、校准性和一致性。我们发现预测训练带来的校准改进在流行基准上具有普遍性。我们开源了所有模型、代码和数据，以使语言模型预测研究广泛可访问。

Summary / 总结

This work aims to enhance language models' ability to reason about open-ended forecasting questions for high-stakes decision-making. The authors synthesize forecasting questions from daily news and train Qwen3 models on a dataset called OpenForesight. They use an offline news corpus to prevent future information leakage. The model, OpenForecaster 8B, shows improved accuracy, calibration, and consistency compared to larger proprietary models. Calibration improvements generalize across benchmarks, and all resources are open-sourced for broader research access.

该研究旨在通过从每日新闻中合成问题来增强语言模型的开放性预测能力。Qwen3模型在名为OpenForesight的数据集上进行训练，使用离线新闻语料库以避免未来信息泄露。模型OpenForecaster 8B在准确度、校准性和一致性方面优于更大规模的专有模型。校准改进在多个基准上具有普适性，所有资源均已开源，以促进更广泛的科研访问。

Falcon-H1R: Pushing the Reasoning Frontiers with a Hybrid Model for Efficient Test-Time Scaling

Authors: Falcon LLM Team, Iheb Chaabane, Puneesh Khanna, Suhail Mohmad, Slim Frikha, Shi Hu, Abdalgader Abubaker, Reda Alami, Mikhail Lubinets, Mohamed El Amine Seddik, Hakim Hacid

First: 2026-01-05T18:44:27+00:00 · Latest: 2026-01-05T18:44:27+00:00

Abs · PDF · Code1 · Code2

Abstract

This work introduces Falcon-H1R, a 7B-parameter reasoning-optimized model that establishes the feasibility of achieving competitive reasoning performance with small language models (SLMs). Falcon-H1R stands out for its parameter efficiency, consistently matching or outperforming SOTA reasoning models that are $2\times$ to $7\times$ larger across a variety of reasoning-intensive benchmarks. These results underscore the importance of careful data curation and targeted training strategies (via both efficient SFT and RL scaling) in delivering significant performance gains without increasing model size. Furthermore, Falcon-H1R advances the 3D limits of reasoning efficiency by combining faster inference (through its hybrid-parallel architecture design), token efficiency, and higher accuracy. This unique blend makes Falcon-H1R-7B a practical backbone for scaling advanced reasoning systems, particularly in scenarios requiring extensive chain-of-thoughts generation and parallel test-time scaling. Leveraging the recently introduced DeepConf approach, Falcon-H1R achieves state-of-the-art test-time scaling efficiency, offering substantial improvements in both accuracy and computational cost. As a result, Falcon-H1R demonstrates that compact models, through targeted model training and architectural choices, can deliver robust and scalable reasoning performance.

中文标题/摘要

标题：Falcon-H1R：通过混合模型实现高效测试时扩展的推理前沿探索

本研究介绍了Falcon-H1R，这是一种70亿参数的推理优化模型，证明了使用小型语言模型（SLMs）实现具有竞争力的推理性能的可行性。Falcon-H1R 以其参数效率著称，在多种推理密集型基准测试中，其性能与比其大2到7倍的SOTA推理模型保持一致或超越。这些结果强调了精心的数据筛选和有针对性的训练策略（包括高效的SFT和RL扩展）的重要性，以在不增加模型规模的情况下实现显著的性能提升。此外，Falcon-H1R 通过结合更快的推理（通过其混合并行架构设计）、更高的标记效率和更高的准确性，突破了推理效率的三维极限。这种独特的结合使Falcon-H1R-7B 成为扩展高级推理系统的实用基础架构，特别是在需要大量链式思考生成和并行测试时扩展的场景中。利用最近引入的DeepConf方法，Falcon-H1R 达到了测试时扩展效率的SOTA，显著提高了准确性和计算成本。因此，Falcon-H1R 证明了通过有针对性的模型训练和架构选择，紧凑型模型可以提供稳健且可扩展的推理性能。

Summary / 总结

Falcon-H1R is a 7B-parameter reasoning-optimized model that achieves competitive reasoning performance with smaller models, outperforming SOTA models that are 2 to 7 times larger across various reasoning benchmarks. It uses efficient data curation and targeted training strategies, including SFT and RL scaling, to enhance performance without increasing model size. Additionally, Falcon-H1R combines faster inference, token efficiency, and higher accuracy, making it suitable for scenarios requiring extensive chain-of-thought generation and parallel test-time scaling. By leveraging DeepConf, it achieves state-of-the-art test-time scaling efficiency, improving both accuracy and computational cost, demonstrating that compact models can deliver robust and scalable reasoning performance.

Falcon-H1R 是一个 7B 参数的推理优化模型，能够在各种推理基准测试中与比其大 2 到 7 倍的 SOTA 模型竞争，甚至超越它们。它通过高效的数据筛选和目标训练策略（包括 SFT 和 RL 扩展）来提升性能，而不增加模型大小。此外，Falcon-H1R 结合了更快的推理、更高的标记效率和更高的准确性，使其适用于需要大量链式思考生成和并行测试时扩展的场景。通过利用 DeepConf 方法，它实现了最先进的测试时扩展效率，提高了准确性和计算成本，证明了紧凑型模型通过目标训练和架构选择可以提供稳健且可扩展的推理性能。

Robust Persona-Aware Toxicity Detection with Prompt Optimization and Learned Ensembling

Authors: Berk Atil, Rebecca J. Passonneau, Ninareh Mehrabi

First: 2026-01-05T18:32:45+00:00 · Latest: 2026-01-05T18:32:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Toxicity detection is inherently subjective, shaped by the diverse perspectives and social priors of different demographic groups. While ``pluralistic'' modeling as used in economics and the social sciences aims to capture perspective differences across contexts, current Large Language Model (LLM) prompting techniques have different results across different personas and base models. In this work, we conduct a systematic evaluation of persona-aware toxicity detection, showing that no single prompting method, including our proposed automated prompt optimization strategy, uniformly dominates across all model-persona pairs. To exploit complementary errors, we explore ensembling four prompting variants and propose a lightweight meta-ensemble: an SVM over the 4-bit vector of prompt predictions. Our results demonstrate that the proposed SVM ensemble consistently outperforms individual prompting methods and traditional majority-voting techniques, achieving the strongest overall performance across diverse personas. This work provides one of the first systematic comparisons of persona-conditioned prompting for toxicity detection and offers a robust method for pluralistic evaluation in subjective NLP tasks.

中文标题/摘要

标题：具有提示优化和学习集成的鲁棒人物意识毒性检测

毒性检测本质上是主观的，受到不同人口统计群体的多样视角和社会先入为主的观念的影响。虽然经济学和社会科学中的“多元”建模旨在捕捉不同背景下的视角差异，但当前的大语言模型（LLM）提示技术在不同人物和基础模型上表现出不同的结果。在本研究中，我们系统评估了人物意识毒性检测，表明没有单一的提示方法，包括我们提出的自动化提示优化策略，在所有模型-人物配对中都能占优。为了利用互补的错误，我们探索了四种提示变体的集成，并提出了一种轻量级的元集成：一个基于4位向量的提示预测的SVM。我们的结果表明，提出的SVM集成在所有人物上始终优于单一提示方法和传统的多数投票技术，实现了最强的整体性能。本研究提供了第一个系统比较条件于人物的提示方法进行毒性检测，并提供了一种在主观自然语言处理任务中进行多元评估的稳健方法。

Summary / 总结

This work addresses the challenge of toxicity detection being context-dependent and influenced by diverse perspectives. It evaluates various prompting methods and proposes an SVM-based ensemble approach to leverage complementary errors across different personas and base models. The study shows that the proposed ensemble method outperforms individual prompting methods and traditional majority-voting techniques, providing a robust solution for pluralistic evaluation in subjective NLP tasks.

该研究通过在不同人口统计学视角下评估个性感知方法，解决了毒性检测的主观性问题。研究发现，没有单一的提示技术能在所有模型和人像中都表现出色。为了提高性能，作者提出了一种基于SVM的轻量级集成方法，将四种提示变体结合起来，结果显示该方法在各种人像中的一致性表现优于单一方法和传统的多数投票技术。这项工作提供了一种在主观NLP任务中进行多元评价的稳健方法。

Vision-Language Reasoning for Geolocalization: A Reinforcement Learning Approach

Authors: Biao Wu, Meng Fang, Ling Chen, Ke Xu, Tao Cheng, Jun Wang

Venue: AAAI 2026

First: 2026-01-01T16:51:41+00:00 · Latest: 2026-01-05T18:27:19+00:00

Comments: Accepted to AAAI 2026. Project Page: https://github.com/aialt/geo-r

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in vision-language models have opened up new possibilities for reasoning-driven image geolocalization. However, existing approaches often rely on synthetic reasoning annotations or external image retrieval, which can limit interpretability and generalizability. In this paper, we present Geo-R, a retrieval-free framework that uncovers structured reasoning paths from existing ground-truth coordinates and optimizes geolocation accuracy via reinforcement learning. We propose the Chain of Region, a rule-based hierarchical reasoning paradigm that generates precise, interpretable supervision by mapping GPS coordinates to geographic entities (e.g., country, province, city) without relying on model-generated or synthetic labels. Building on this, we introduce a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance, enabling the model to refine predictions through spatially meaningful feedback. Our approach bridges structured geographic reasoning with direct spatial supervision, yielding improved localization accuracy, stronger generalization, and more transparent inference. Experimental results across multiple benchmarks confirm the effectiveness of Geo-R, establishing a new retrieval-free paradigm for scalable and interpretable image geolocalization. To facilitate further research and ensure reproducibility, both the model and code will be made publicly available.

中文标题/摘要

标题：基于视觉-语言推理的地理定位：一种强化学习方法

视觉-语言模型的最新进展为基于推理的图像地理定位开启了新的可能性。然而，现有方法往往依赖于合成的推理注释或外部图像检索，这可能会限制可解释性和泛化能力。在本文中，我们提出了Geo-R，这是一种无需检索的框架，可以从现有的地面真实坐标中发现结构化的推理路径，并通过强化学习优化地理定位精度。我们提出了区域链，这是一种基于规则的分层推理范式，通过将GPS坐标映射到地理实体（例如，国家、省份、城市）来生成精确且可解释的监督，而不依赖于模型生成或合成标签。在此基础上，我们引入了一种基于哈弗斯ine距离的坐标对齐奖励的轻量级强化学习策略，使模型能够通过空间上有意义的反馈来细化预测。我们的方法将结构化的地理推理与直接的空间监督相结合，提高了定位精度，增强了泛化能力，并提供了更透明的推理。在多个基准上的实验结果证实了Geo-R的有效性，建立了新的无需检索的可扩展且可解释的图像地理定位范式。为了促进进一步的研究并确保可再现性，模型和代码将公开提供。

Summary / 总结

This paper introduces Geo-R, a retrieval-free framework for image geolocalization that uses reinforcement learning to optimize geolocation accuracy. It proposes the Chain of Region, a rule-based hierarchical reasoning paradigm that maps GPS coordinates to geographic entities, and a lightweight reinforcement learning strategy with coordinate-aligned rewards based on Haversine distance. The approach improves localization accuracy and generalization while providing more transparent inference. Experiments across multiple benchmarks demonstrate the effectiveness of Geo-R, establishing a new paradigm for scalable and interpretable image geolocalization.

本文提出了Geo-R，一种无需检索的图像地理定位框架，利用强化学习优化地理定位精度。它提出了基于规则的层级推理范式Chain of Region，将GPS坐标映射到地理实体以生成精确的监督。模型使用基于Haversine距离的坐标对齐奖励来通过空间反馈细化预测。实验表明，Geo-R在提高地理定位精度、泛化能力和透明性方面优于现有方法，为可扩展和可解释的图像地理定位设定了新的基准。

Diminishing Returns in Self-Supervised Learning

Authors: Oli Bridge, Huey Sun, Botond Branyicskai-Nagy, Charles D'Ornano, Shomit Basu

First: 2025-12-03T15:11:44+00:00 · Latest: 2026-01-05T18:17:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Transformer-based architectures have become a dominant paradigm in vision and language, but their success is often attributed to large model capacity and massive training data. In this work, we examine how self-supervised pre-training, intermediate fine-tuning, and downstream fine-tuning interact in a low-capacity regime, using a 5M-parameter Vision Transformer for semantic segmentation. Across multiple data scales, we find that masked image modeling pre-training and downstream fine-tuning reliably improve performance, but with clear diminishing returns as supervision increases. In contrast, inserting an intermediate classification fine-tuning stage consistently degrades downstream performance, with the largest drops occurring precisely where pre-training is most effective. Through an analysis of patch-level representation geometry, we show that classification-based intermediate supervision actively interferes with representations learned during pre-training by collapsing spatial structure critical for dense prediction. These results indicate that, in small models, the geometry of supervision matters more than the number of training stages: misaligned intermediate objectives can negate the benefits of pre-training rather than amplify them.

中文标题/摘要

标题：自我监督学习中的递减回报

基于变换器的架构已成为视觉和语言领域的主导范式，但其成功往往归因于大模型容量和大量训练数据。在本工作中，我们研究了在低容量范围内自我监督预训练、中间微调和下游微调之间的相互作用，使用一个500万参数的视觉变换器进行语义分割。在多个数据尺度上，我们发现掩码图像建模预训练和下游微调可以可靠地提高性能，但随着监督增加，回报逐渐递减。相反，插入中间分类微调阶段始终会降低下游性能，最大的下降发生在预训练最有效的区域。通过分析块级表示几何结构，我们表明基于分类的中间监督会通过压缩对密集预测至关重要的空间结构，主动干扰在预训练中学习到的表示。这些结果表明，在小模型中，监督的几何结构比训练阶段的数量更重要：不一致的中间目标会抵消预训练的好处，而不是放大它们。

Summary / 总结

This study investigates the interaction between self-supervised pre-training, intermediate fine-tuning, and downstream fine-tuning in a low-capacity Vision Transformer for semantic segmentation. The research finds that pre-training and downstream fine-tuning consistently improve performance, but with diminishing returns as supervision increases. Inserting an intermediate classification fine-tuning stage degrades performance, especially where pre-training is most effective. The analysis reveals that classification-based intermediate supervision interferes with pre-trained representations, collapsing spatial structure crucial for dense prediction.

研究考察了低容量Vision Transformer在语义分割中的自监督预训练、中间细调和下游细调之间的相互作用。研究发现，随着监督增加，性能提升的效果逐渐减弱，而中间分类细调会一致地降低下游性能。分析表明，基于分类的中间监督会干扰预训练得到的表示，特别是通过压缩对密集预测至关重要的空间结构。

Estimating Text Temperature

Authors: Nikolay Mikhaylovskiy

First: 2026-01-05T18:09:41+00:00 · Latest: 2026-01-05T18:09:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Autoregressive language models typically use temperature parameter at inference to shape the probability distribution and control the randomness of the text generated. After the text was generated, this parameter can be estimated using maximum likelihood approach. Following it, we propose a procedure to estimate the temperature of any text, including ones written by humans, with respect to a given language model. We evaluate the temperature estimation capability of a wide selection of small-to-medium LLMs. We then use the best-performing Qwen3 14B to estimate temperatures of popular corpora.

中文标题/摘要

标题：估算文本温度

自回归语言模型通常在推理时使用温度参数来塑造概率分布并控制生成文本的随机性。生成文本后，可以使用最大似然方法估计该参数。随后，我们提出了一种方法，用于根据给定的语言模型估算任何文本（包括人类撰写的文本）的温度。我们评估了多种小型到中型LLM的温度估算能力。然后，我们使用表现最佳的Qwen3 14B来估算流行语料库的温度。

Summary / 总结

The study aims to estimate the temperature parameter of generated text to control its randomness, which is typically used in autoregressive language models. The authors propose a method to estimate the temperature of any text, including human-written ones, relative to a given language model. They evaluate this capability using various small-to-medium language models and find that Qwen3 14B performs the best, successfully estimating the temperatures of popular corpora.

研究旨在估计生成文本的温度参数以控制其随机性，通常在推理过程中通过自回归语言模型调整该参数。作者提出了一种方法，可以估计任何文本（包括人类撰写的文本）相对于给定语言模型的温度。他们使用各种小型到中型语言模型进行了评估，并发现Qwen3 14B表现最佳，成功估计了流行语料库的温度。

BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache

Authors: Dayou Du, Shijie Cao, Jianyi Cheng, Luo Mai, Ting Cao, Mao Yang

First: 2025-03-24T15:22:41+00:00 · Latest: 2026-01-05T18:08:27+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

The growth of long-context Large Language Models (LLMs) significantly increases memory and bandwidth pressure during autoregressive decoding due to the expanding Key-Value (KV) cache. While accuracy-preserving KV-cache quantization (e.g., 4-bit or 2-bit) reduces memory footprint, existing systems decode inefficiently by relying solely on CUDA cores, underutilizing Tensor Cores-the dominant compute resource on GPUs. We present BitDecoding, the first inference system to efficiently decode low-bit KV caches by cooperatively leveraging CUDA cores and Tensor Cores. BitDecoding smartly induces Tensor-Core-friendly layouts, introduces warp-level dequantization parallelism, and provides unified system support through query transformation, high-performance tensor- and channel-wise quantization, and a software-pipelined dequantization kernel enabling mixed-precision execution. Architecture-aware optimizations further leverage Hopper's warpgroup tensor instructions and Blackwell's NVFP4 (MXFP4) tensor formats. Evaluated on Blackwell, Hopper, and Ampere GPUs, BitDecoding achieves an average 7.5x decoding speedup over FP16 FlashDecoding-v2, up to 8.6x on Blackwell with NVFP4, and up to 4.3x over state-of-the-art approaches. On LLaMA-3.1-8B with a 128K context, BitDecoding reduces single-batch decoding latency by 3x. BitDecoding is open-sourced at https://github.com/OpenBitSys/BitDecoding.

中文标题/摘要

标题：BitDecoding：利用张量核心解锁低比特KV缓存以支持长上下文LLM

长上下文大型语言模型（LLM）的增长在自回归解码过程中显著增加了内存和带宽压力，由于KV缓存的扩大。虽然保留准确性的KV缓存量化（例如4比特或2比特）可以减少内存占用，但现有系统通过仅依赖CUDA内核进行解码，未能充分利用张量核心——GPU上最主要的计算资源。我们提出了BitDecoding，这是首个通过协同利用CUDA内核和张量核心高效解码低比特KV缓存的推理系统。BitDecoding智能地诱导张量核心友好的布局，引入了波级去量化并行性，并通过查询转换、高性能张量和通道量化以及软件流水线去量化内核提供统一的系统支持，实现混合精度执行。针对Hopper架构的优化进一步利用了Hopper的波组张量指令和Blackwell的NVFP4（MXFP4）张量格式。在Blackwell、Hopper和Ampere GPU上评估，BitDecoding相对于FP16 FlashDecoding-v2平均实现了7.5倍的解码加速，在Blackwell上使用NVFP4时最多可达到8.6倍，相对于最先进的方法最多可达到4.3倍。在LLaMA-3.1-8B模型中，128K上下文的单批次解码延迟减少了3倍。BitDecoding已在https://github.com/OpenBitSys/BitDecoding开源。

Summary / 总结

BitDecoding addresses the memory and bandwidth challenges in decoding long-context LLMs by efficiently utilizing both CUDA cores and Tensor Cores. It introduces Tensor-Core-friendly layouts, warp-level dequantization parallelism, and optimized quantization techniques. BitDecoding achieves an average 7.5x speedup in decoding over FP16 FlashDecoding-v2 and up to 8.6x on Blackwell with NVFP4, significantly reducing single-batch decoding latency for LLaMA-3.1-8B with a 128K context.

BitDecoding通过高效利用CUDA核心和张量核心来解决长上下文LLM解码中的内存和带宽问题。它引入了优化的布局、战级并行性和先进的量化技术。BitDecoding在解码速度上平均比FP16 FlashDecoding-v2快7.5倍，最高在Blackwell上达到8.6倍，并且可以将LLaMA-3.1-8B在128K上下文下的单批次解码延迟减少3倍。

DatBench: Discriminative, Faithful, and Efficient VLM Evaluations

Authors: Siddharth Joshi, Haoli Yin, Rishabh Adiga, Ricardo Monti, Aldo Carranza, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer, Luke Merrick, Parth Doshi, Paul Burstein, Pratyush Maini, Scott Loftin, Spandan Das, Tony Jiang, Vineeth Dorna, Zhengping Wang, Bogdan Gaza, Ari Morcos, Matthew Leavitt

First: 2026-01-05T18:07:51+00:00 · Latest: 2026-01-05T18:07:51+00:00

Abs · PDF · Code1 · Code2

Abstract

Empirical evaluation serves as the primary compass guiding research progress in foundation models. Despite a large body of work focused on training frontier vision-language models (VLMs), approaches to their evaluation remain nascent. To guide their maturation, we propose three desiderata that evaluations should satisfy: (1) faithfulness to the modality and application, (2) discriminability between models of varying quality, and (3) efficiency in compute. Through this lens, we identify critical failure modes that violate faithfulness and discriminability, misrepresenting model capabilities: (i) multiple-choice formats reward guessing, poorly reflect downstream use cases, and saturate early as models improve; (ii) blindly solvable questions, which can be answered without images, constitute up to 70% of some evaluations; and (iii) mislabeled or ambiguous samples compromise up to 42% of examples in certain datasets. Regarding efficiency, the computational burden of evaluating frontier models has become prohibitive: by some accounts, nearly 20% of development compute is devoted to evaluation alone. Rather than discarding existing benchmarks, we curate them via transformation and filtering to maximize fidelity and discriminability. We find that converting multiple-choice questions to generative tasks reveals sharp capability drops of up to 35%. In addition, filtering blindly solvable and mislabeled samples improves discriminative power while simultaneously reducing computational cost. We release DatBench-Full, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench, a discriminative subset that achieves 13x average speedup (up to 50x) while closely matching the discriminative power of the original datasets. Our work outlines a path toward evaluation practices that are both rigorous and sustainable as VLMs continue to scale.

中文标题/摘要

标题：DatBench：区分性、忠实性和高效性的VLM评估

经验性评估是指导基础模型研究进展的主要指南。尽管有大量的工作集中在训练前沿的视觉-语言模型（VLMs）上，但对其评估的方法仍处于初级阶段。为了促进其成熟，我们提出了评估应满足的三个标准：（1）忠实于模态和应用，（2）能够区分不同质量的模型，（3）计算效率。通过这一视角，我们识别出一些关键的失败模式，这些模式违反了忠实性和区分性，错误地代表了模型的能力：（i）多项选择题奖励猜测，不能很好地反映下游使用场景，并且随着模型的改进而饱和；（ii）一些可以不使用图像直接回答的问题占到了某些评估的70%以上；（iii）错误标记或模棱两可的样本在某些数据集中占到了42%。关于效率，评估前沿模型的计算负担已经变得难以承受：据一些说法，近20%的开发计算资源被用于评估。我们没有抛弃现有的基准，而是通过转换和筛选来优化它们，以最大化忠实性和区分性。我们发现，将多项选择题转换为生成任务可以揭示出高达35%的能力下降。此外，过滤掉可以不使用图像直接回答的问题和错误标记的样本可以提高区分能力，同时降低计算成本。我们发布了DatBench-Full，这是一个包含33个数据集的清理评估套件，涵盖了九种VLM能力，以及DatBench，这是一个区分性子集，实现了13倍的平均加速（最高可达50倍），同时与原始数据集的区分能力非常接近。我们的工作概述了一条通往评估实践的道路，这些实践既严格又可持续，随着VLMs的不断扩展。

Summary / 总结

The paper introduces DatBench, a new evaluation framework for vision-language models (VLMs) that emphasizes faithfulness, discriminability, and efficiency. It identifies issues in existing evaluations, such as multiple-choice formats that encourage guessing and mislabeled samples that compromise model assessment. The authors propose transforming multiple-choice questions into generative tasks and filtering out blindly solvable and mislabeled samples to enhance evaluation quality. The resulting DatBench-Full suite includes 33 datasets, while DatBench is a more efficient subset that achieves up to 50x speedup without sacrificing discriminative power. This work aims to guide the maturation of VLM evaluations by providing a robust and sustainable framework.

论文提出了DatBench，这是一个新的视觉-语言模型（VLM）评估套件，旨在满足三个关键要求：忠实性、可区分性和效率。作者指出了诸如选择题格式奖励猜测和标记错误样本等问题，这些都可能歪曲模型的能力。作者将选择题转换为生成任务，并过滤掉可直接解答和标记错误的样本，从而得到一个包含33个数据集的清理套件DatBench-Full，以及一个可区分子集DatBench，其平均加速13倍（最高可达50倍），同时保持与原始数据集相当的可区分能力。这项工作旨在指导VLM评估实践的成熟，并使其更加严谨和可持续。

Prithvi-Complimentary Adaptive Fusion Encoder (CAFE): unlocking full-potential for flood inundation mapping

Authors: Saurabh Kaushik, Lalit Maurya, Beth Tellman

Venue: WACV 2026

First: 2026-01-05T18:07:21+00:00 · Latest: 2026-01-05T18:07:21+00:00

Comments: Accepted at CV4EO Workshop @ WACV 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Geo-Foundation Models (GFMs), have proven effective in diverse downstream applications, including semantic segmentation, classification, and regression tasks. However, in case of flood mapping using Sen1Flood11 dataset as a downstream task, GFMs struggles to outperform the baseline U-Net, highlighting model's limitation in capturing critical local nuances. To address this, we present the Prithvi-Complementary Adaptive Fusion Encoder (CAFE), which integrate Prithvi GFM pretrained encoder with a parallel CNN residual branch enhanced by Convolutional Attention Modules (CAM). Prithvi-CAFE enables fast and efficient fine-tuning through adapters in Prithvi and performs multi-scale, multi-level fusion with CNN features, capturing critical local details while preserving long-range dependencies. We achieve state-of-the-art results on two comprehensive flood mapping datasets: Sen1Flood11 and FloodPlanet. On Sen1Flood11 test data, Prithvi-CAFE (IoU 83.41) outperforms the original Prithvi (IoU 82.50) and other major GFMs (TerraMind 82.90, DOFA 81.54, spectralGPT: 81.02). The improvement is even more pronounced on the hold-out test site, where Prithvi-CAFE achieves an IoU of 81.37 compared to the baseline U-Net (70.57) and original Prithvi (72.42). On FloodPlanet, Prithvi-CAFE also surpasses the baseline U-Net and other GFMs, achieving an IoU of 64.70 compared to U-Net (60.14), Terramind (62.33), DOFA (59.15) and Prithvi 2.0 (61.91). Our proposed simple yet effective Prithvi-CAFE demonstrates strong potential for improving segmentation tasks where multi-channel and multi-modal data provide complementary information and local details are critical. The code is released on \href{https://github.com/Sk-2103/Prithvi-CAFE}{Prithvi-CAFE Github}

中文标题/摘要

标题：普瑞希维-互补自适应融合编码器（CAFE）：解锁全潜力洪水淹没制图

地理基础模型（GFMs）在多种下游应用中表现出色，包括语义分割、分类和回归任务。然而，在使用Sen1Flood11数据集进行洪水制图时，GFMs难以超越基线U-Net，显示出模型在捕捉关键局部细节方面的局限性。为解决这一问题，我们提出了普瑞希维-互补自适应融合编码器（CAFE），它将普瑞希维GFMs预训练编码器与通过卷积注意力模块（CAM）增强的并行CNN残差分支相结合。普瑞希维-CAFE通过普瑞希维中的适配器实现快速高效微调，并在多尺度、多级融合CNN特征的同时捕捉关键局部细节，保持长程依赖性。我们在两个全面的洪水制图数据集：Sen1Flood11和FloodPlanet上取得了最先进的结果。在Sen1Flood11测试数据上，普瑞希维-CAFE（IoU 83.41）优于原始普瑞希维（IoU 82.50）和其他主要GFMs（TerraMind 82.90，DOFA 81.54，spectralGPT：81.02）。在保留测试站点上，普瑞希维-CAFE的IoU为81.37，而基线U-Net为70.57，原始普瑞希维为72.42。在FloodPlanet上，普瑞希维-CAFE也超越了基线U-Net和其他GFMs，IoU为64.70，而U-Net为60.14，Terramind为62.33，DOFA为59.15，普瑞希维2.0为61.91。我们提出的简单而有效的普瑞希维-CAFE展示了在多通道和多模态数据提供互补信息且局部细节至关重要的分割任务中提高性能的强大潜力。代码发布在<https://github.com/Sk-2103/Prithvi-CAFE>。

Summary / 总结

The paper addresses the limitation of Geo-Foundation Models (GFMs) in flood mapping tasks by introducing Prithvi-Complementary Adaptive Fusion Encoder (CAFE). CAFE integrates a pretrained Prithvi GFM encoder with a CNN residual branch enhanced by Convolutional Attention Modules (CAM), enabling efficient fine-tuning and multi-scale feature fusion. On the Sen1Flood11 and FloodPlanet datasets, CAFE outperforms existing GFMs and baselines, achieving higher IoU scores, particularly on the hold-out test site where it surpasses U-Net by a significant margin.

论文通过引入Prithvi-Complementary Adaptive Fusion Encoder (CAFE)来解决Geo-Foundation Models (GFMs)在洪水映射任务中的局限性。CAFE将预训练的Prithvi GFM编码器与增强的Convolutional Attention Modules (CAM)的CNN残差分支相结合，实现高效的微调和多尺度特征融合。在Sen1Flood11和FloodPlanet数据集上，CAFE在IoU指标上超越了现有模型和基线，特别是在保留的测试站点上，其性能显著优于U-Net。

Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents

Authors: Sourena Khanzadeh

First: 2026-01-05T18:05:29+00:00 · Latest: 2026-01-05T18:05:29+00:00

Abs · PDF · Code1 · Code2

Abstract

As Large Language Model (LLM) agents are increasingly tasked with high-stakes autonomous decision-making, the transparency of their reasoning processes has become a critical safety concern. While \textit{Chain-of-Thought} (CoT) prompting allows agents to generate human-readable reasoning traces, it remains unclear whether these traces are \textbf{faithful} generative drivers of the model's output or merely \textbf{post-hoc rationalizations}. We introduce \textbf{Project Ariadne}, a novel XAI framework that utilizes Structural Causal Models (SCMs) and counterfactual logic to audit the causal integrity of agentic reasoning. Unlike existing interpretability methods that rely on surface-level textual similarity, Project Ariadne performs \textbf{hard interventions} ($do$-calculus) on intermediate reasoning nodes -- systematically inverting logic, negating premises, and reversing factual claims -- to measure the \textbf{Causal Sensitivity} ($φ$) of the terminal answer. Our empirical evaluation of state-of-the-art models reveals a persistent \textit{Faithfulness Gap}. We define and detect a widespread failure mode termed \textbf{Causal Decoupling}, where agents exhibit a violation density ($ρ$) of up to $0.77$ in factual and scientific domains. In these instances, agents arrive at identical conclusions despite contradictory internal logic, proving that their reasoning traces function as "Reasoning Theater" while decision-making is governed by latent parametric priors. Our findings suggest that current agentic architectures are inherently prone to unfaithful explanation, and we propose the Ariadne Score as a new benchmark for aligning stated logic with model action.

中文标题/摘要

标题：项目阿里阿德涅：一种结构因果框架，用于审计大型语言模型代理的忠实性

随着大型语言模型（LLM）代理在高风险自主决策任务中的应用日益增多，其推理过程的透明度已成为一个关键的安全问题。虽然“思维链”（CoT）提示允许代理生成可读的推理轨迹，但尚不清楚这些轨迹是否是模型输出的忠实生成驱动因素，还是仅仅事后合理化。我们提出了“项目阿里阿德涅”，这是一种新颖的可解释性人工智能（XAI）框架，利用结构因果模型（SCMs）和反事实逻辑来审计代理推理的因果完整性。与依赖表面文本相似性的现有可解释性方法不同，项目阿里阿德涅通过对中间推理节点进行硬干预（$do$-计算）——系统地反转逻辑、否定前提和逆转事实声明——来测量终端答案的因果敏感性（$φ$）。我们对最先进的模型的实证评估揭示了一个持续存在的“忠实性差距”。我们定义并检测了一种普遍存在的失败模式，称为“因果脱耦”，其中代理在事实和科学领域中的因果脱耦密度（$ρ$）高达0.77。在这些情况下，尽管内部逻辑矛盾，代理仍得出相同的结论，证明其推理轨迹充当“推理剧场”，而决策则由潜在参数先验控制。我们的研究结果表明，当前的代理架构本质上容易产生不忠实的解释，并提出了阿里阿德涅分数作为新的基准，以使声明的逻辑与模型行为保持一致。

Summary / 总结

Project Ariadne introduces a novel framework to audit the faithfulness of Large Language Model (LLM) agents' reasoning processes. By using Structural Causal Models and counterfactual logic, it measures the causal sensitivity of the model's output to hard interventions. The evaluation shows a significant Faithfulness Gap, with up to 77% violation density in factual and scientific domains, indicating that agents often provide post-hoc rationalizations rather than faithful reasoning drivers. This suggests that current LLM architectures are prone to unfaithful explanations, and the Ariadne Score is proposed as a new benchmark for alignment between stated logic and model action.

Project Ariadne 提出了一种新的框架来审计大型语言模型（LLM）代理推理过程的忠实性。通过使用结构因果模型和反事实逻辑，它衡量模型输出对硬干预的因果敏感性。评估显示存在显著的‘忠实性差距’，并识别了一种广泛的‘因果脱耦’故障模式，即代理在内部逻辑矛盾的情况下仍能达到相同的结论，表明它们的推理痕迹往往是事后合理化而非忠实驱动模型输出的原因。

Placement Semantics for Distributed Deep Learning: A Systematic Framework for Analyzing Parallelism Strategies

Authors: Deep Pankajbhai Mehta

First: 2026-01-05T18:01:38+00:00 · Latest: 2026-01-05T18:01:38+00:00

Comments: 8 pages, 3 tables

Abs · PDF · Code1 · Code2

Abstract

Training large language models requires distributing computation across many accelerators, yet practitioners select parallelism strategies (data, tensor, pipeline, ZeRO) through trial and error because no unified systematic framework predicts their behavior. We introduce placement semantics: each strategy is specified by how it places four training states (parameters, optimizer, gradients, activations) across devices using five modes (replicated, sharded, sharded-with-gather, materialized, offloaded). From placement alone, without implementation details, we derive memory consumption and communication volume. Our predictions match published results exactly: ZeRO-3 uses 8x less memory than data parallelism at 1.5x communication cost, as reported in the original paper. We prove two conditions (gradient integrity, state consistency) are necessary and sufficient for distributed training to match single-device results, and provide composition rules for combining strategies safely. The framework unifies ZeRO Stages 1-3, Fully Sharded Data Parallel (FSDP), tensor parallelism, and pipeline parallelism as instances with different placement choices.

Summary / 总结

This paper addresses the challenge of selecting parallelism strategies for distributed deep learning by introducing placement semantics. It specifies each strategy based on how it places training states across devices and derives memory consumption and communication volume from placement alone. Key findings include that ZeRO-3 uses 8 times less memory than data parallelism but at 1.5 times the communication cost, and the framework unifies various strategies like ZeRO Stages 1-3, FSDP, tensor parallelism, and pipeline parallelism under a common framework.

论文通过引入放置语义来解决分布式深度学习中选择并行策略的挑战。该框架基于训练状态在设备上的放置方式来指定每个策略，并在不涉及实现细节的情况下推导出内存消耗和通信量。关键发现包括ZeRO-3的内存使用量仅为数据并行的8倍，但通信成本为1.5倍，该框架还统一了各种并行策略，并提出了两个必要且充分的条件，以确保分布式训练能够达到单设备的结果。

Tales of the 2025 Los Angeles Fire: Hotwash for Public Health Concerns in Reddit via LLM-Enhanced Topic Modeling

Authors: Sulong Zhou, Qunying Huang, Shaoheng Zhou, Yun Hang, Xinyue Ye, Aodong Mei, Kathryn Phung, Yuning Ye, Uma Govindswamy, Zehan Li

First: 2025-05-14T16:31:08+00:00 · Latest: 2026-01-05T18:01:24+00:00

Comments: Fix typos in Method Section. Add data/code availability

Abs · PDF · Code1 · Code2

Abstract

Wildfires have become increasingly frequent, irregular, and severe in recent years. Understanding how affected populations perceive and respond during wildfire crises is critical for timely and empathetic disaster response. Social media platforms offer a crowd-sourced channel to capture evolving public discourse, providing hyperlocal information and insight into public sentiment. This study analyzes Reddit discourse during the 2025 Los Angeles wildfires, spanning from the onset of the disaster to full containment. We collect 385 posts and 114,879 comments related to the Palisades and Eaton fires. We adopt topic modeling methods to identify the latent topics, enhanced by large language models (LLMs) and human-in-the-loop (HITL) refinement. Furthermore, we develop a hierarchical framework to categorize latent topics, consisting of two main categories, Situational Awareness (SA) and Crisis Narratives (CN). The volume of SA category closely aligns with real-world fire progressions, peaking within the first 2-5 days as the fires reach the maximum extent. The most frequent co-occurring category set of public health and safety, loss and damage, and emergency resources expands on a wide range of health-related latent topics, including environmental health, occupational health, and one health. Grief signals and mental health risks consistently accounted for 60 percentage and 40 percentage of CN instances, respectively, with the highest total volume occurring at night. This study contributes the first annotated social media dataset on the 2025 LA fires, and introduces a scalable multi-layer framework that leverages topic modeling for crisis discourse analysis. By identifying persistent public health concerns, our results can inform more empathetic and adaptive strategies for disaster response, public health communication, and future research in comparable climate-related disaster events.

中文标题/摘要

标题：2025洛杉矶火灾故事：Reddit上的公共健康关切热洗

近年来，野火的发生越来越频繁、不规则且严重。理解受灾人群在野火危机期间的感知和应对方式对于及时和富有同情心的灾害响应至关重要。社交媒体平台提供了众包渠道，用于捕捉不断演变的公众讨论，提供超本地信息和公众情绪的洞察。本研究分析了2025年洛杉矶野火期间的Reddit讨论，从灾难开始到完全扑灭。我们收集了与帕利塞德斯和伊顿火灾相关的385个帖子和114,879条评论。我们采用主题建模方法来识别潜在主题，这些方法通过大型语言模型（LLMs）和人工在环（HITL）改进。此外，我们开发了一个分层框架来分类潜在主题，包括两个主要类别：情况意识（SA）和危机叙事（CN）。SA类别的规模与实际火灾进程高度一致，在火灾达到最大范围的前2-5天内达到峰值。公共健康和安全、损失和损害以及应急资源的最频繁共现类别集扩展了广泛范围的健康相关潜在主题，包括环境健康、职业健康和人与健康。悲伤信号和心理健康风险分别占CN实例的60%和40%，总规模最高出现在夜间。本研究提供了第一个关于2025年洛杉矶火灾的标注社交媒体数据集，并引入了一种利用主题建模的可扩展多层框架，用于危机讨论分析。通过识别持续存在的公共健康关切，我们的结果可以为灾害响应、公共卫生沟通和未来类似气候相关灾害事件的研究提供更具同情心和适应性的策略。

Grounded Test-Time Adaptation for LLM Agents

Authors: Arthur Chen, Zuxin Liu, Jianguo Zhang, Akshara Prabhakar, Zhiwei Liu, Shelby Heinecke, Silvio Savarese, Victor Zhong, Caiming Xiong

First: 2025-11-06T22:24:35+00:00 · Latest: 2026-01-05T17:43:48+00:00

Comments: Our code is available here: https://github.com/r2llab/GTTA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large language model (LLM)-based agents struggle to generalize to novel and complex environments, such as unseen websites or new sets of functions, due to a fundamental mismatch between their pre-training and test-time conditions. This challenge stems from two distinct failure modes: a syntactic misunderstanding of environment-specific components like observation formats, and a semantic misunderstanding of state-transition dynamics, which are only revealed at test time. To address these issues, we propose two distinct and complementary strategies for adapting LLM agents by leveraging environment-specific information available during deployment. First, an online distributional adaptation method parameterizes environmental nuances by learning a lightweight adaptation vector that biases the model's output distribution, enabling rapid alignment with an environment response format. Second, a deployment-time dynamics grounding method employs a persona-driven exploration phase to systematically probe and learn the environment's causal dynamics before task execution, equipping the agent with a nonparametric world model. We evaluate these strategies across diverse agentic benchmarks, including function calling and web navigation. Our empirical results show the effectiveness of both strategies across all benchmarks with minimal computational cost. We find that dynamics grounding is particularly effective in complex environments where unpredictable dynamics pose a major obstacle, demonstrating a robust path toward more generalizable and capable LLM-based agents. For example, on the WebArena multi-site split, this method increases the agent's success rate from 2% to 23%.

中文标题/摘要

标题：基于地面测试时适应的大语言模型（LLM）代理

基于大语言模型（LLM）的代理在处理新颖和复杂的环境时难以泛化，例如未见过的网站或新的功能集，这是因为它们的预训练条件与测试条件之间存在根本性的不匹配。这一挑战源自两种不同的失败模式：对环境特定组件如观察格式的句法误解，以及对状态转换动力学的语义误解，这些动力学仅在测试时才显现。为了解决这些问题，我们提出了两种不同的且互补的策略，通过利用部署期间可用的环境特定信息来适应LLM代理。首先，一种在线分布适应方法通过学习一个轻量级的适应向量来参数化环境的细微差别，该向量偏置模型的输出分布，从而实现快速与环境响应格式的对齐。其次，一种部署时动力学接地方法采用基于人设的探索阶段系统地探查和学习环境的动力学，为执行任务前提供一个非参数化的世界模型。我们在多种代理基准测试中评估了这些策略，包括函数调用和网页导航。我们的实验证明了这两种策略在所有基准测试中的有效性，且计算成本较低。我们发现，动力学接地在复杂环境中特别有效，这些环境中不可预测的动力学构成了重大障碍，展示了通向更泛化和能力更强的LLM代理的稳健路径。例如，在WebArena多站点分割中，这种方法将代理的成功率从2%提高到23%。

Summary / 总结

This paper addresses the challenge of large language model (LLM) agents failing to generalize to novel environments by proposing two adaptation strategies. The first strategy, online distributional adaptation, uses a lightweight adaptation vector to align the model's output distribution with the environment's response format. The second strategy, deployment-time dynamics grounding, involves a persona-driven exploration phase to learn the environment's causal dynamics, providing a nonparametric world model. Experiments across various benchmarks show both methods are effective with minimal computational cost, with dynamics grounding proving especially beneficial in complex environments.

本文提出了两种适应策略来解决大型语言模型（LLM）代理无法泛化到新环境的问题。第一个策略是在线分布适应，通过轻量级的适应向量对齐模型的输出分布与环境的响应格式。第二个策略是部署时动力学接地，通过个性化的探索阶段学习环境的动力学，提供一个非参数的世界模型。实验表明，这两种方法在各种基准测试中都表现出色，且计算成本较低，特别是在复杂环境中，动力学接地特别有效，将代理的成功率从2%提高到23%。

SortWaste: A Densely Annotated Dataset for Object Detection in Industrial Waste Sorting

Authors: Sara Inácio, Hugo Proença, João C. Neves

First: 2026-01-05T17:34:50+00:00 · Latest: 2026-01-05T17:34:50+00:00

Comments: 9 pages

Abs · PDF · Code1 · Code2

Abstract

The increasing production of waste, driven by population growth, has created challenges in managing and recycling materials effectively. Manual waste sorting is a common practice; however, it remains inefficient for handling large-scale waste streams and presents health risks for workers. On the other hand, existing automated sorting approaches still struggle with the high variability, clutter, and visual complexity of real-world waste streams. The lack of real-world datasets for waste sorting is a major reason automated systems for this problem are underdeveloped. Accordingly, we introduce SortWaste, a densely annotated object detection dataset collected from a Material Recovery Facility. Additionally, we contribute to standardizing waste detection in sorting lines by proposing ClutterScore, an objective metric that gauges the scene's hardness level using a set of proxies that affect visual complexity (e.g., object count, class and size entropy, and spatial overlap). In addition to these contributions, we provide an extensive benchmark of state-of-the-art object detection models, detailing their results with respect to the hardness level assessed by the proposed metric. Despite achieving promising results (mAP of 59.7% in the plastic-only detection task), performance significantly decreases in highly cluttered scenes. This highlights the need for novel and more challenging datasets on the topic.

中文标题/摘要

标题：SortWaste：工业废弃物分类中的密集标注数据集

随着人口增长导致的废弃物产量增加，有效管理和回收材料的挑战也随之而来。手工废弃物分类是一种常见做法，但处理大规模废弃物流仍效率低下，并且对工人健康构成风险。另一方面，现有的自动化分类方法仍然难以应对实际废弃物流中的高变异性、杂乱和视觉复杂性。缺乏实际的废弃物分类数据集是导致此类问题的自动化系统发展不足的主要原因之一。因此，我们介绍了SortWaste，这是一个从材料回收设施收集的密集标注的目标检测数据集。此外，我们通过提出ClutterScore，一种使用影响视觉复杂度的代理（例如物体数量、类别和大小熵以及空间重叠）来衡量场景难度水平的客观指标，为分类线上的废弃物检测标准化做出了贡献。除了这些贡献，我们还提供了最先进的目标检测模型的广泛基准测试，详细说明了它们在所提出的指标评估的难度水平下的结果。尽管在仅塑料检测任务中取得了令人鼓舞的结果（mAP为59.7%），但在高度杂乱的场景中性能显著下降。这突显了在该主题上需要新颖且更具挑战性的数据集的需求。

Summary / 总结

The paper introduces SortWaste, a densely annotated dataset for object detection in industrial waste sorting, addressing the inefficiency and health risks of manual sorting and the limitations of existing automated systems. The dataset includes a ClutterScore metric to assess the visual complexity of waste scenes. Experiments show that state-of-the-art models achieve 59.7% mAP in plastic-only detection but perform poorly in highly cluttered scenes, indicating the need for more challenging datasets in this domain.

该论文介绍了SortWaste，一个用于工业废弃物分类中目标检测的密集标注数据集，旨在解决手工分类的低效性和健康风险问题。它提出了ClutterScore，一个客观指标来评估场景的难度，并对最先进的目标检测模型进行了基准测试，结果显示在高度杂乱的场景中性能显著下降，强调了该领域需要更具挑战性的数据集。

MIND Your Reasoning: A Meta-Cognitive Intuitive-Reflective Network for Dual-Reasoning in Multimodal Stance Detection

Authors: Bingbing Wang, Zhengda Jin, Bin Liang, Wenjie Li, Jing Li, Ruifeng Xu, Min Zhang

First: 2025-11-08T15:56:24+00:00 · Latest: 2026-01-05T17:33:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Multimodal Stance Detection (MSD) is a crucial task for understanding public opinion on social media. Existing methods predominantly operate by learning to fuse modalities. They lack an explicit reasoning process to discern how inter-modal dynamics, such as irony or conflict, collectively shape the user's final stance, leading to frequent misjudgments. To address this, we advocate for a paradigm shift from *learning to fuse* to *learning to reason*. We introduce **MIND**, a **M**eta-cognitive **I**ntuitive-reflective **N**etwork for **D**ual-reasoning. Inspired by the dual-process theory of human cognition, MIND operationalizes a self-improving loop. It first generates a rapid, intuitive hypothesis by querying evolving Modality and Semantic Experience Pools. Subsequently, a meta-cognitive reflective stage uses Modality-CoT and Semantic-CoT to scrutinize this initial judgment, distill superior adaptive strategies, and evolve the experience pools themselves. These dual experience structures are continuously refined during training and recalled at inference to guide robust and context-aware stance decisions. Extensive experiments on the MMSD benchmark demonstrate that our MIND significantly outperforms most baseline models and exhibits strong generalization.

中文标题/摘要

标题：MIND你的推理：一种元认知直觉反思网络在多模态立场检测中的双重推理

多模态立场检测（MSD）是理解社交媒体上公众意见的关键任务。现有方法主要通过学习融合模态来运作，缺乏明确的推理过程来区分模态间动态（如讽刺或冲突）如何共同塑造用户的最终立场，导致频繁的误判。为解决这一问题，我们提倡从“学习融合”转向“学习推理”的范式转变。我们引入了**MIND**，一种**元认知直觉反思网络**，基于人类认知的双重过程理论，MIND 实现了一个自我改进的循环。首先，通过查询不断变化的模态和语义经验池生成快速的直觉假设。随后，元认知反思阶段使用模态-CoT 和语义-CoT 来审查初始判断，提炼出更优的适应性策略，并进化经验池本身。这些双重经验结构在训练中不断精炼，并在推理时被召回以指导稳健且上下文相关的立场决策。在 MMSD 基准上的广泛实验表明，我们的 MIND 显著优于大多数基线模型，并表现出强大的泛化能力。

Summary / 总结

The paper addresses the limitations of existing multimodal stance detection methods by proposing MIND, a meta-cognitive intuitive-reflective network. MIND introduces a dual-reasoning process where an initial intuitive hypothesis is generated and then scrutinized by a meta-cognitive reflective stage. This process continuously refines experience pools during training and guides stance decisions at inference. Experiments show that MIND outperforms most baseline models and demonstrates strong generalization.

研究旨在通过解决现有方法中缺乏明确推理的问题，提高多模态立场检测的性能。引入了MIND，一种元认知直觉反思网络，增强推理过程。它通过直觉阶段生成初始假设，然后通过元认知反思阶段对其进行细化，不断进化其经验池。实验表明，MIND在基准模型中表现出色，并且具有很强的泛化能力。

Anytime-Valid Answer Sufficiency Certificates for LLM Generation via Sequential Information Lift

Authors: Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

First: 2025-10-07T21:28:53+00:00 · Latest: 2026-01-05T17:33:34+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce Sequential-EDFL (Empirical Dynamic Formal Lift), which applies anytime-valid sequential testing to language model generation stopping. Our approach tracks information lift, defined as the log-likelihood ratio between the full model and deliberately weakened "skeleton" baselines, using self-normalized empirical-Bernstein e-processes that provide formal delta-level error control regardless of stopping time. This delta guarantee controls premature stopping when information lift is insufficient relative to the skeleton, and it does not imply delta control of factual incorrectness or hallucinations. We handle unknown centering through online mean estimation, combine multiple parameters via mixture e-processes, and support adaptive resets under distributional drift. On six benchmarks, Sequential-EDFL reduces generation length by 22 to 28 percent relative to sequential baselines while maintaining delta-level control with 12 percent computational overhead. We introduce automated skeletons (distilled submodels and randomized logits) and show robustness across skeleton families. Composing EDFL with a lightweight correctness gate (sentence boundaries plus a verifier) improves end-task correctness while preserving anytime-valid guarantees by only delaying stopping. Our certificates control information sufficiency, not factual correctness. Specifically, 10.9 percent of stopped sequences remain incorrect even with the gate (13.2 to 22.7 percent without it). EDFL serves as a first-stage filter that can reduce verification burden: when applied to stopped sequences, the gate validates 83 percent of stops, requiring full verification only for the remaining 17 percent, plus all non-stopped sequences. EDFL is not a standalone solution for safety-critical domains.

中文标题/摘要

标题：LLM生成中的任意时序有效答案充分性证书通过顺序信息提升

我们引入了Sequential-EDFL（经验动态形式提升），它将任意时序测试应用于语言模型生成停止。我们的方法跟踪信息提升，定义为完整模型与故意削弱的“骨架”基线之间的对数似然比，使用自规范化经验伯恩斯坦e过程，无论停止时间如何，都能提供形式化的delta级误差控制。这种delta保证在信息提升不足以相对于骨架时控制提前停止，但它不意味着delta控制事实错误或幻觉。我们通过在线均值估计处理未知中心化，通过混合e过程结合多个参数，并在分布漂移下支持自适应重置。在六个基准测试中，Sequential-EDFL相对于顺序基线将生成长度减少了22%到28%，同时以12%的计算开销维持delta级控制。我们引入了自动化骨架（提炼子模型和随机化logits），并在不同骨架家族中展示了鲁棒性。将EDFL与轻量级正确性门（句子边界加上验证器）结合使用，可以提高最终任务的正确性，同时保留任意时序有效的保证，仅延迟停止。我们的证书控制信息充分性，而不是事实正确性。具体来说，即使有门控，仍有10.9%的停止序列保持错误（没有门控时为13.2%到22.7%）。当应用于停止序列时，门控验证83%的停止，仅对剩余17%的停止和所有未停止序列进行完整验证。EDFL不是安全关键领域中的独立解决方案。

Summary / 总结

Sequential-EDFL applies anytime-valid sequential testing to language model generation, using information lift to control stopping time with formal delta-level error control. It reduces generation length by 22 to 28 percent relative to sequential baselines while maintaining delta-level control with 12 percent computational overhead. Automated skeletons and a correctness gate improve end-task correctness while preserving anytime-valid guarantees, though some sequences remain incorrect even with the gate.

Sequential-EDFL 通过应用任意时间有效的序列测试来控制语言模型生成的停止时间，使用自归一化经验-Bernstein e-过程跟踪信息提升。在六个基准上，它将生成长度减少了22到28个百分点，同时计算开销增加了12个百分点，且保持了delta级控制。自动骨架和正确性门控可以提高最终任务的正确性，尽管在没有门控的情况下，仍有10.9%的停止序列是错误的，但门控可以验证83%的停止，从而减少全面验证的需求。

Power-of-Two Quantization-Aware-Training (PoT-QAT) in Large Language Models (LLMs)

Authors: Mahmoud Elgenedy

First: 2026-01-05T17:33:16+00:00 · Latest: 2026-01-05T17:33:16+00:00

Abs · PDF · Code1 · Code2

Abstract

In Large Language Models (LLMs), the number of parameters has grown exponentially in the past few years, e.g., from 1.5 billion parameters in GPT-2 to 175 billion in GPT-3 to possibly more than trillion in higher versions. This raises a significant challenge for implementation, especially for Edge devices. Unlike cloud computing, memory and processing power for Edge devices are very limited, which necessitates developing novel ideas to make such applications feasible. In this work, we investigate compressing weights with a special quantization that limits numbers to only power-of-two (PoT). This helps save a huge amount of memory as only exponents need to be stored, more importantly, it significantly reduces processing power by replacing costly multiplication with low cost bit shifting. To overcome performance loss due to this strict quantization, we investigate Quantization Aware Training (QAT) to enhance performance through additional training. Results on GPT-2 124M show a major enhancement for quantized PoT model after additional training, with a perplexity enhancement of 66% and BERT-Score loss to baseline GPT-2 of 1%. The memory saving is estimated to be 87.5% while the inference speed is expected to be 3-10x faster with PoT quantization versus full-precision.

中文标题/摘要

标题：大型语言模型（LLMs）中的幂次量化感知训练（PoT-QAT）

在大型语言模型（LLMs）中，过去几年参数的数量呈指数增长，例如，从GPT-2的15亿参数到GPT-3的175亿参数，再到更高版本可能超过万亿参数。这给实施带来了重大挑战，尤其是对于边缘设备。与云计算不同，边缘设备的内存和处理能力非常有限，这需要开发新的想法来使此类应用可行。在本文中，我们研究了一种特殊的量化方法，将数字限制为仅幂次（PoT），这有助于节省大量内存，因为只需存储指数，更重要的是，通过用低成本的位移操作替换昂贵的乘法来显著减少处理能力。为了克服由于这种严格的量化而导致的性能损失，我们研究了量化感知训练（QAT）以通过额外训练提高性能。在GPT-2 124M上的结果表明，在额外训练后，量化PoT模型有重大改进，困惑度提高了66%，BERT-Score损失相对于基线GPT-2为1%。估计内存节省为87.5%，而使用PoT量化进行推理的速度预计比全精度快3-10倍。

Summary / 总结

This paper explores the use of Power-of-Two Quantization-Aware-Training (PoT-QAT) to compress weights in Large Language Models (LLMs) by limiting numbers to power-of-two values, which reduces memory usage and processing power. The method involves additional training to mitigate performance loss due to quantization. Experiments on GPT-2 124M demonstrate a 66% perplexity improvement and a 1% BERT-Score loss compared to the baseline, with an estimated 87.5% memory saving and up to 10x faster inference speed.

本文探讨了使用Power-of-Two Quantization-Aware-Training (PoT-QAT) 来压缩大型语言模型（LLMs）的权重，通过将数值限制为幂次方值，显著减少了内存使用和处理能力。该方法包括额外的训练以缓解由于严格量化而导致的性能损失。实验表明，与基线模型相比，GPT-2 124M 的困惑度提高了66%，BERT-Score损失为1%，内存节省估计为87.5%，推理速度预计比全精度模型快3-10倍。

Language as a Wave Phenomenon: Iso-Energetic Phase-Locking and Semantic Interference in Neural Networks

Authors: Alper Yıldırım, İbrahim Yücedağ

First: 2025-12-01T02:46:15+00:00 · Latest: 2026-01-05T17:26:51+00:00

Comments: Major Revision. Title changed to reflect the new theoretical framework. Complete narrative shift from "Optimization Efficiency" to "Iso-Energetic Phase Coding" and "Optical Hardware Compatibility". Replaced ISMR diagnostics with Holographic Optical Learning simulations and mechanistic "Dual-Regime" phase analysis. Comparison with spectral baselines (FNet) added

Abs · PDF · Code1 · Code2

Abstract

Conventional deep learning paradigms rely on metabolically expensive magnitude-based representations, rendering them fundamentally incompatible with passive photonic hardware. We introduce PRISM, a sequence modeling architecture that bridges high-level reasoning and physical constraints by enforcing an Iso-Energetic (Unity Gain) principle, compelling the network to encode semantic information exclusively in the phase angle. Validated on the WMT14 translation benchmark, PRISM achieves a 0.799 COMET score, demonstrating that phase-based reasoning competes with standard Transformers (0.821) and functionally matches unconstrained spectral baselines like FNet (0.805), despite enforcing strict energy constraints and requiring 11.5% fewer parameters. Furthermore, to verify hardware feasibility, we simulate a Holographic Backpropagation mechanism on a noisy, 4-bit optical correlator. Ablation studies reveal a substantial performance gain (48.4% vs. 62.4%) over a frozen baseline, proving that the proposed phase-steering mechanism actively optimizes physical parameters under strict energy constraints. These results establish an existence proof that ultra-low-power, passive optical hardware can support high-level linguistic intelligence without sacrificing representational capacity.

中文标题/摘要

标题：语言作为一种波现象：等能相位锁定与神经网络中的语义干扰

传统的深度学习范式依赖于代谢昂贵的幅度表示，使其与被动光子硬件从根本上不兼容。我们引入了PRISM，这是一种通过强制网络仅在相位角中编码语义信息来实现高能级推理和物理约束相结合的序列建模架构。在WMT14翻译基准上验证，PRISM获得了0.799的COMET分数，表明基于相位的推理与标准Transformer（0.821）竞争，并且在严格能量约束下与未受约束的光谱基线FNet（0.805）功能上相当，尽管参数减少了11.5%。此外，为了验证硬件可行性，我们在嘈杂的4位光学相关器上模拟了全息反向传播机制。消融研究显示，与冻结基线相比，性能提高了48.4%，证明了所提出的相位控制机制在严格能量约束下积极优化了物理参数。这些结果证明了超低功耗、被动光学硬件可以在不牺牲表示能力的情况下支持高级语言智能。

Causal Consistency Regularization: Training Verifiably Sensitive Reasoning in Large Language Models

Authors: Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

First: 2025-09-01T15:18:46+00:00 · Latest: 2026-01-05T17:24:02+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language models can produce correct answers while relying on flawed reasoning traces, partly because common training objectives reward final-answer correctness rather than faithful intermediate reasoning. This undermines trustworthiness in high-stakes settings. We propose Counterfactual Sensitivity Regularization (CSR), a training paradigm that improves reasoning faithfulness by enforcing causal consistency between reasoning steps and outcomes. CSR automatically applies operator-level interventions to reasoning traces, such as swapping "+" with "-", to generate minimally perturbed counterfactual rationales, and penalizes the model when these logically invalid traces still lead to the original answer. Our implementation is efficient, adding about 9 percent training overhead via a warm-start curriculum and token-subset optimization. We evaluate faithfulness using Counterfactual Outcome Sensitivity (COS), which measures how appropriately answers change under logical perturbations. Across arithmetic (GSM8K), logical deduction (ProofWriter), multi-hop question answering (HotpotQA), and code generation (MBPP), CSR yields improved accuracy versus faithfulness trade-offs, establishing a new Pareto frontier. CSR improves faithfulness over standard fine-tuning and process supervision by up to 70 percentage points, and transfers across model families with 94.2 to 96.7 percent success in structured domains. CSR also complements inference-time methods such as self-consistency. Overall, CSR offers a practical route to more reliable reasoning in structured domains, including mathematics, formal logic, and code, where operators are well-defined and verifiable, covering an estimated 40 to 60 percent of high-stakes reasoning deployments.

中文标题/摘要

标题：因果一致性正则化：训练可验证敏感推理的大语言模型

大语言模型可以在依赖有缺陷的推理路径的同时产生正确的答案，部分原因是常见的训练目标奖励最终答案的正确性而不是忠实的中间推理。这在高风险环境中削弱了可信度。我们提出了反事实敏感性正则化（CSR），一种通过在推理步骤和结果之间强制因果一致性来提高推理忠实性的训练范式。CSR 自动对推理路径应用操作级干预，例如将“+”替换为“-”，以生成最小扰动的反事实推理，并在这些逻辑无效的路径仍然导致原始答案时惩罚模型。我们的实现是高效的，通过预热课程和子集优化大约增加了9%的训练开销。我们使用反事实结果敏感性（COS）来评估忠实性，COS 衡量在逻辑扰动下答案的适当变化。在算术（GSM8K）、逻辑推理（ProofWriter）、多跳问答（HotpotQA）和代码生成（MBPP）中，CSR 在准确性和忠实性之间提供了改进的权衡，建立了新的帕累托前沿。CSR 在标准微调和过程监督方面的忠实性提高了70个百分点以上，并在结构化领域跨模型家族转移成功率为94.2%到96.7%。CSR 还补充了推理时的方法，如自我一致性。总体而言，CSR 提供了一条通往结构化领域更可靠推理的实用途径，包括数学、形式逻辑和代码，其中操作是定义良好且可验证的，估计覆盖了高风险推理部署的40%到60%。

Summary / 总结

The paper aims to enhance the reasoning faithfulness of large language models by addressing their reliance on flawed reasoning traces. It introduces Counterfactual Sensitivity Regularization (CSR), which enforces causal consistency between reasoning steps and outcomes through operator-level interventions. CSR improves the accuracy versus faithfulness trade-off, achieving up to 70 percentage point gains in faithfulness over standard fine-tuning and process supervision. The method is efficient, adding only about 9 percent training overhead. Evaluations across various domains show CSR's effectiveness in improving faithfulness and transferring across model families.

论文针对大型语言模型依赖错误推理产生正确答案的问题，这损害了它们的可信度。它提出了一种名为Counterfactual Sensitivity Regularization (CSR)的训练方法，该方法通过因果一致性确保推理步骤与结果的一致性。CSR通过操作级干预生成反事实推理，并在这些无效推理仍导致原始答案时惩罚模型。该方法在数学和代码生成等结构化领域提高了忠实性，而不会显著影响准确性，建立了新的帕累托前沿。

pdfQA: Diverse, Challenging, and Realistic Question Answering over PDFs

Authors: Tobias Schimanski, Imene Kolli, Jingwei Ni, Yu Fan, Ario Saeid Vaghefi, Elliott Ash, Markus Leippold

First: 2026-01-05T17:15:26+00:00 · Latest: 2026-01-05T17:15:26+00:00

Abs · PDF · Code1 · Code2

Abstract

PDFs are the second-most used document type on the internet (after HTML). Yet, existing QA datasets commonly start from text sources or only address specific domains. In this paper, we present pdfQA, a multi-domain 2K human-annotated (real-pdfQA) and 2K synthetic dataset (syn-pdfQA) differentiating QA pairs in ten complexity dimensions (e.g., file type, source modality, source position, answer type). We apply and evaluate quality and difficulty filters on both datasets, obtaining valid and challenging QA pairs. We answer the questions with open-source LLMs, revealing existing challenges that correlate with our complexity dimensions. pdfQA presents a basis for end-to-end QA pipeline evaluation, testing diverse skill sets and local optimizations (e.g., in information retrieval or parsing).

中文标题/摘要

标题：pdfQA：在PDF上的多样化、具有挑战性和现实性的问答

PDF是互联网上使用第二多的文档类型（仅次于HTML）。然而，现有的问答数据集通常从文本来源开始，或者仅针对特定领域。在本文中，我们介绍了pdfQA，这是一个多领域的2000个人工标注的真实PDF问答数据集（real-pdfQA）和2000个合成数据集（syn-pdfQA），在十个复杂维度上区分问答对（例如，文件类型、来源模态、来源位置、答案类型）。我们在两个数据集上应用并评估了质量与难度过滤器，获得有效的具有挑战性的问答对。我们使用开源的大规模语言模型回答这些问题，揭示了与我们的复杂维度相关的现有挑战。pdfQA为端到端的问答管道评估提供了基础，测试了多种技能和局部优化（例如，在信息检索或解析方面的优化）。

Summary / 总结

The paper introduces pdfQA, a dataset designed to evaluate question answering systems on PDFs, which are widely used documents. It consists of 2,000 human-annotated and 2,000 synthetic QA pairs across ten complexity dimensions. The authors apply quality and difficulty filters to ensure the dataset is valid and challenging. They use open-source LLMs to answer the questions and identify challenges that align with the complexity dimensions, providing a basis for evaluating end-to-end QA pipelines and testing various skills and optimizations.

论文介绍了pdfQA数据集，旨在评估PDF上的问答系统，PDF是广泛使用的文档类型。该数据集包含2000个人工标注和2000个合成的问答对，涵盖了十个复杂维度。作者应用质量和难度过滤器以确保数据集有效且具有挑战性。他们使用开源LLM回答问题，并识别与复杂维度相一致的挑战，为评估端到端的问答管道和测试各种技能和优化提供基础。

TopoLoRA-SAM: Topology-Aware Parameter-Efficient Adaptation of Foundation Segmenters for Thin-Structure and Cross-Domain Binary Semantic Segmentation

Authors: Salim Khazem

First: 2026-01-05T17:03:45+00:00 · Latest: 2026-01-05T17:03:45+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Foundation segmentation models such as the Segment Anything Model (SAM) exhibit strong zero-shot generalization through large-scale pretraining, but adapting them to domain-specific semantic segmentation remains challenging, particularly for thin structures (e.g., retinal vessels) and noisy modalities (e.g., SAR imagery). Full fine-tuning is computationally expensive and risks catastrophic forgetting. We propose \textbf{TopoLoRA-SAM}, a topology-aware and parameter-efficient adaptation framework for binary semantic segmentation. TopoLoRA-SAM injects Low-Rank Adaptation (LoRA) into the frozen ViT encoder, augmented with a lightweight spatial convolutional adapter and optional topology-aware supervision via differentiable clDice. We evaluate our approach on five benchmarks spanning retinal vessel segmentation (DRIVE, STARE, CHASE\_DB1), polyp segmentation (Kvasir-SEG), and SAR sea/land segmentation (SL-SSDD), comparing against U-Net, DeepLabV3+, SegFormer, and Mask2Former. TopoLoRA-SAM achieves the best retina-average Dice and the best overall average Dice across datasets, while training only \textbf{5.2\%} of model parameters ($\sim$4.9M). On the challenging CHASE\_DB1 dataset, our method substantially improves segmentation accuracy and robustness, demonstrating that topology-aware parameter-efficient adaptation can match or exceed fully fine-tuned specialist models. Code is available at : https://github.com/salimkhazem/Seglab.git

中文标题/摘要

标题：TopoLoRA-SAM：拓扑感知参数高效适应基础分割模型以实现薄结构和跨域二元语义分割

基础分割模型如分割一切模型（SAM）通过大规模预训练表现出强大的零样本泛化能力，但将其适应到特定领域的语义分割仍然具有挑战性，尤其是对于薄结构（例如视网膜血管）和嘈杂的模态（例如SAR影像）。全量微调计算成本高昂且存在灾难性遗忘的风险。我们提出了一种拓扑感知和参数高效的适应框架——TopoLoRA-SAM。TopoLoRA-SAM 将低秩适应（LoRA）注入冻结的ViT编码器，并增加了轻量级的空间卷积适配器和可选的拓扑感知监督（通过可微分的clDice实现）。我们在五个基准上评估了该方法，涵盖了视网膜血管分割（DRIVE、STARE、CHASE_DB1）、息肉分割（Kvasir-SEG）和SAR海/陆分割（SL-SSDD），并与U-Net、DeepLabV3+、SegFormer和Mask2Former进行比较。TopoLoRA-SAM 在视网膜平均Dice和整体平均Dice方面均取得最佳成绩，同时仅训练了模型参数的5.2%（约4.9M）。在具有挑战性的CHASE_DB1数据集上，我们的方法显著提高了分割准确性和鲁棒性，证明了拓扑感知参数高效的适应可以与或超越完全微调的专业模型。代码可在：https://github.com/salimkhazem/Seglab.git

Summary / 总结

TopoLoRA-SAM is a topology-aware and parameter-efficient adaptation framework for binary semantic segmentation, designed to address the challenges of adapting foundation models like SAM to thin structures and noisy modalities. It injects Low-Rank Adaptation into the frozen ViT encoder and adds a lightweight spatial convolutional adapter, with optional topology-aware supervision. Experiments on five benchmarks show that TopoLoRA-SAM achieves the best retina-average Dice and overall average Dice across datasets, training only 5.2% of the model parameters (approximately 4.9M). On the challenging CHASE_DB1 dataset, it significantly improves segmentation accuracy and robustness, matching or surpassing fully fine-tuned specialist models.

研究旨在解决将基础分割模型如SAM适应特定领域任务的挑战，特别是对于细小结构和噪声数据。TopoLoRA-SAM是一种拓扑感知且参数高效的适应框架，将LoRA注入冻结的ViT编码器，并添加了一个轻量级的空间卷积适配器。该方法在多种基准测试中实现了最佳的视网膜平均Dice值和总体平均Dice值，仅训练了5.2%的模型参数。在具有挑战性的CHASE_DB1数据集上，它显著提高了分割准确性和鲁棒性，超过了全量微调和其他模型。

DiffProxy: Multi-View Human Mesh Recovery via Diffusion-Generated Dense Proxies

Authors: Renke Wang, Zhenyu Zhang, Ying Tai, Jian Yang

First: 2026-01-05T16:51:45+00:00 · Latest: 2026-01-05T16:51:45+00:00

Comments: Page: https://wrk226.github.io/DiffProxy.html, Code: https://github.com/wrk226/DiffProxy

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Human mesh recovery from multi-view images faces a fundamental challenge: real-world datasets contain imperfect ground-truth annotations that bias the models' training, while synthetic data with precise supervision suffers from domain gap. In this paper, we propose DiffProxy, a novel framework that generates multi-view consistent human proxies for mesh recovery. Central to DiffProxy is leveraging the diffusion-based generative priors to bridge the synthetic training and real-world generalization. Its key innovations include: (1) a multi-conditional mechanism for generating multi-view consistent, pixel-aligned human proxies; (2) a hand refinement module that incorporates flexible visual prompts to enhance local details; and (3) an uncertainty-aware test-time scaling method that increases robustness to challenging cases during optimization. These designs ensure that the mesh recovery process effectively benefits from the precise synthetic ground truth and generative advantages of the diffusion-based pipeline. Trained entirely on synthetic data, DiffProxy achieves state-of-the-art performance across five real-world benchmarks, demonstrating strong zero-shot generalization particularly on challenging scenarios with occlusions and partial views. Project page: https://wrk226.github.io/DiffProxy.html

中文标题/摘要

标题：DiffProxy：通过扩散生成密集代理的多视角人体网格恢复

从多视角图像恢复人体网格面临着一个根本性的挑战：现实世界的数据集包含有偏的不完美标注，影响模型的训练，而带有精确监督的合成数据则存在领域差距。本文提出了一种名为DiffProxy的新框架，用于生成多视角一致的人体代理以进行网格恢复。DiffProxy的核心在于利用基于扩散的生成先验来弥合合成训练与现实世界泛化的差距。其关键创新包括：(1) 一种多条件机制，用于生成多视角一致、像素对齐的人体代理；(2) 一种手部细化模块，结合灵活的视觉提示以增强局部细节；(3) 一种基于不确定性测试时的缩放方法，以在优化过程中提高对具有遮挡和部分视角的挑战性情况的鲁棒性。这些设计确保了网格恢复过程能够有效利用精确的合成标注和基于扩散的生成优势。DiffProxy完全在合成数据上训练，实现了五个现实世界基准上的最佳性能，特别是在具有遮挡和部分视角的挑战性场景中表现出强大的零样本泛化能力。

Summary / 总结

DiffProxy is a novel framework for human mesh recovery from multi-view images, addressing the challenge of imperfect ground-truth annotations in real-world datasets and the domain gap in synthetic data. It generates multi-view consistent human proxies using diffusion-based generative priors, incorporating a multi-conditional mechanism, a hand refinement module, and an uncertainty-aware scaling method. DiffProxy, trained solely on synthetic data, achieves state-of-the-art performance across five real-world benchmarks, especially in challenging scenarios with occlusions and partial views.

DiffProxy 是一种用于从多视角图像恢复人体网格的新框架，解决了现实世界注解偏差和合成数据域差异的挑战。它使用基于扩散的生成先验生成多视角一致的人体代理，并包含手部细化模块和不确定性感知的测试时缩放方法。DiffProxy 仅在合成数据上训练，已在五个现实世界基准上超越现有方法，特别是在具有遮挡和部分视角的挑战性场景中表现出色。

Towards Fair In-Context Learning with Tabular Foundation Models

Authors: Patrik Kenfack, Samira Ebrahimi Kahou, Ulrich Aïvodji

First: 2025-05-14T15:53:14+00:00 · Latest: 2026-01-05T16:39:29+00:00

Comments: Published in Transactions on Machine Learning Research (TMLR)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Transformer-based tabular foundation models have recently demonstrated promising in-context learning (ICL) performance on structured data, emerging as competitive alternatives to gradient-boosted trees. However, the fairness implications of this new paradigm remain largely unexplored. We present the first investigation of fairness in tabular ICL, evaluating three recently proposed foundation models--TabPFNv2, TabICL, and TabDPT--on multiple benchmark datasets. To mitigate biases, we explore three pre-processing fairness-enhancing methods: correlation removal (decorrelating input features from the sensitive attribute), group-balanced sample selection (ensuring equal representation of protected groups in context examples), and uncertainty-based sample selection (prioritizing context examples with high sensitive-attribute prediction uncertainty). Our experiments show that the uncertainty-based strategy consistently improves group fairness metrics (e.g., demographic parity, equalized odds, and equal opportunity) with minimal impact on predictive accuracy. We release our code to facilitate reproducibility https://github.com/patrikken/Fair-TabICL.

中文标题/摘要

标题：迈向公平的表格上下文学习

基于变换器的表格基础模型在结构化数据上最近展示了有希望的上下文学习（ICL）性能，成为梯度增强树的有竞争力的替代方案。然而，这种新范式的公平性影响尚未得到充分探索。我们首次对表格ICL中的公平性进行了研究，评估了三种最近提出的基础模型——TabPFNv2、TabICL和TabDPT——在多个基准数据集上的表现。为了减轻偏差，我们探索了三种预处理公平性增强方法：相关性去除（使输入特征与敏感属性解相关）、群体平衡样本选择（确保受保护群体在上下文示例中的平等代表性）和基于不确定性样本选择（优先选择敏感属性预测不确定性高的上下文示例）。我们的实验表明，基于不确定性的策略在最小影响预测准确性的情况下，始终能提高群体公平性指标（如人口统计公平性、同等机会和同等概率）。我们发布了代码以促进可重复性：https://github.com/patrikken/Fair-TabICL。

Summary / 总结

This study investigates fairness in in-context learning (ICL) for tabular data using transformer-based foundation models, such as TabPFNv2, TabICL, and TabDPT. To address potential biases, the authors evaluate three pre-processing methods: correlation removal, group-balanced sample selection, and uncertainty-based sample selection. The results indicate that the uncertainty-based strategy enhances group fairness metrics with little effect on predictive accuracy, making it a promising approach for fair ICL in tabular data. The code is available at https://github.com/patrikken/Fair-TabICL.

该研究探讨了基于变压器的表结构数据在上下文学习（ICL）中的公平性问题，评估了TabPFNv2、TabICL和TabDPT三种基础模型。为缓解偏见，作者应用了三种预处理方法：相关性去除、分组平衡样本选择和不确定性样本选择。研究结果表明，不确定性方法在提高群体公平性指标的同时，对预测准确性的影响很小。

Training More Robust Classification Model via Discriminative Loss and Gaussian Noise Injection

Authors: Hai-Vy Nguyen, Fabrice Gamboa, Sixin Zhang, Reda Chhaibi, Serge Gratton, Thierry Giaccone

First: 2024-05-28T18:10:45+00:00 · Latest: 2026-01-05T16:38:03+00:00

Comments: Published in Transactions on Machine Learning Research (TMLR)

Abs · PDF · Code1 · Code2

Abstract

Robustness of deep neural networks to input noise remains a critical challenge, as naive noise injection often degrades accuracy on clean (uncorrupted) data. We propose a novel training framework that addresses this trade-off through two complementary objectives. First, we introduce a loss function applied at the penultimate layer that explicitly enforces intra-class compactness and increases the margin to analytically defined decision boundaries. This enhances feature discriminativeness and class separability for clean data. Second, we propose a class-wise feature alignment mechanism that brings noisy data clusters closer to their clean counterparts. Furthermore, we provide a theoretical analysis demonstrating that improving feature stability under additive Gaussian noise implicitly reduces the curvature of the softmax loss landscape in input space, as measured by Hessian eigenvalues.This thus naturally enhances robustness without explicit curvature penalties. Conversely, we also theoretically show that lower curvatures lead to more robust models. We validate the effectiveness of our method on standard benchmarks and our custom dataset. Our approach significantly reinforces model robustness to various perturbations while maintaining high accuracy on clean data, advancing the understanding and practice of noise-robust deep learning.

中文标题/摘要

标题：通过判别性损失和高斯噪声注入训练更具鲁棒性的分类模型

深度神经网络对输入噪声的鲁棒性仍然是一个关键挑战，因为简单的噪声注入往往会降低干净（未受污染）数据上的准确率。我们提出了一种新的训练框架，通过两个互补的目标来解决这种权衡。首先，我们引入了一个应用于倒数第二层的损失函数，该函数明确地促进了类内紧凑性并增加了到分析定义的决策边界的余量，从而增强了干净数据的特征可判别性和类别可分性。其次，我们提出了一种类内特征对齐机制，将噪声数据簇拉近其干净的对应物。此外，我们还提供了一种理论分析，证明在加性高斯噪声下提高特征稳定性隐式地减少了softmax损失景观在输入空间中的曲率，这可以通过海森矩阵特征值来衡量。因此，这自然增强了鲁棒性，而无需显式的曲率惩罚。相反，我们还从理论上证明了较低的曲率会导致更鲁棒的模型。我们在标准基准和我们自定义的数据集上验证了我们方法的有效性。我们的方法在各种扰动下显著增强了模型的鲁棒性，同时在干净数据上保持了高准确率，推动了噪声鲁棒深度学习的理解和实践。

Summary / 总结

The paper proposes a training framework to improve the robustness of deep neural networks against input noise. It introduces a discriminative loss function and a class-wise feature alignment mechanism. The loss function enhances feature discriminativeness and class separability for clean data, while the feature alignment mechanism brings noisy data closer to their clean counterparts. Theoretical analysis shows that improving feature stability under noise reduces the curvature of the softmax loss landscape, enhancing robustness. Experiments on standard benchmarks and a custom dataset demonstrate that the proposed method significantly improves model robustness to various perturbations while maintaining high accuracy on clean data.

论文旨在解决训练深度神经网络在对抗输入噪声的同时保持在干净数据上的准确性。提出了一种新的训练框架，包含两个目标：在倒数第二层使用区分性损失函数以增强特征的区分性和类别可分性，以及一种类别特征对齐机制以使噪声数据更接近干净数据。理论分析表明，这种方法降低了softmax损失景观在输入空间的曲率，从而增强了鲁棒性。在标准基准数据集和自定义数据集上的实验表明，所提出的方法显著提高了模型对各种扰动的鲁棒性，同时在干净数据上保持了高准确性。