Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision
Authors: Aadarsh Sahoo, Georgia Gkioxari
First: 2026-02-13T18:58:30+00:00 · Latest: 2026-02-13T18:58:30+00:00
Comments: Project webpage: https://glab-caltech.github.io/converseg/
Abstract
Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., "left-most apple") and overlooks functional and physical reasoning (e.g., "where can I safely store the knife?"). We address this gap and introduce Conversational Image Segmentation (CIS) and ConverSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConverSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt-mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConverSeg-Net trained on our data engine achieves significant gains on ConverSeg and maintains strong performance on existing language-guided segmentation benchmarks. Project webpage: https://glab-caltech.github.io/converseg/
中文标题/摘要
标题:对话图像分割:通过可扩展监督锚定抽象概念
对话图像分割将抽象、意图驱动的概念锚定到像素精确的掩码中。先前的参考图像接地工作主要集中在类别和空间查询(例如,“最左边的苹果”)上,并忽略了功能性和物理推理(例如,“我可以在哪里安全地存放刀具?”)。我们填补了这一空白并引入了对话图像分割(CIS)和ConverSeg基准,该基准涵盖了实体、空间关系、意图、功能、安全性和物理推理。我们还提出了ConverSeg-Net,它将强大的分割先验与语言理解融合在一起,并且使用AI驱动的数据引擎生成无需人工监督的提示-掩码对。我们展示了当前的语言引导分割模型对于CIS来说是不足的,而使用我们数据引擎训练的ConverSeg-Net在ConverSeg上取得了显著的提升,并在现有的语言引导分割基准上保持了强大的性能。项目网页:https://glab-caltech.github.io/converseg/
Summary / 总结
The research aims to ground abstract concepts into pixel-accurate masks through conversational image segmentation, addressing the limitations of prior work which focused on categorical and spatial queries. The study introduces Conversational Image Segmentation (CIS) and ConverSeg, a benchmark that includes entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. ConverSeg-Net, a model that combines strong segmentation priors with language understanding, is proposed and trained using an AI-powered data engine. The results show that current language-guided segmentation models are insufficient for CIS, while ConverSeg-Net significantly improves performance on ConverSeg and maintains strong results on existing benchmarks.
研究旨在通过对话式图像分割将抽象概念转化为像素级准确的掩码,解决先前工作仅关注类别和空间查询的局限性。研究引入了对话式图像分割(CIS)和ConverSeg基准,该基准涵盖了实体、空间关系、意图、功能、安全性和物理推理。提出并训练了结合强分割先验与语言理解的ConverSeg-Net模型,使用AI驱动的数据引擎生成提示-掩码对。结果显示,当前的语言引导分割模型在CIS上不足,而ConverSeg-Net在ConverSeg上取得了显著提升,并在现有基准上保持了良好的性能。
Semantic Chunking and the Entropy of Natural Language
Authors: Weishun Zhong, Doron Sivan, Tankut Can, Mikhail Katkov, Misha Tsodyks
First: 2026-02-13T18:58:10+00:00 · Latest: 2026-02-13T18:58:10+00:00
Comments: 29 pages, 9 figures
Abstract
The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the semantic hierarchy. The entropy rate predicted by our model agrees with the estimated entropy rate of printed English. Moreover, our theory further reveals that the entropy rate of natural language is not fixed but should increase systematically with the semantic complexity of corpora, which are captured by the only free parameter in our model.
中文标题/摘要
标题:语义切块与自然语言的熵
印刷英语的熵率著名地估计为每个字符大约一个比特,这是一个基准,现代大型语言模型(LLMs)仅在最近才接近。这一熵率意味着英语相对于预期的每个字符五比特的随机文本,几乎含有80%的冗余。我们引入了一个统计模型,试图捕捉自然语言复杂的多层次结构,提供了一个从第一原理出发的冗余水平解释。该模型描述了一种自相似地将文本切块为语义上连贯的片段的过程,直到单个单词级别。文本的语义结构可以逐级分解,从而允许进行分析处理。现代LLMs和开源数据集的数值实验表明,我们的模型在语义层次的不同水平上定量地捕捉了真实文本的结构。我们的模型预测的熵率与印刷英语估计的熵率一致。此外,我们的理论还揭示了自然语言的熵率不是固定的,而是应该随着语料库的语义复杂性系统地增加,这由我们模型中的唯一自由参数来捕捉。
Summary / 总结
This study aims to understand the redundancy in natural language by modeling its semantic structure. The authors introduce a statistical model that segments text into semantically coherent chunks, from sentences to individual words, and hierarchically decomposes the semantic structure. Experiments with modern language models and open datasets show that the model accurately captures the entropy rate of natural language, which is about one bit per character, similar to the entropy rate of printed English. The model also suggests that the entropy rate increases with the semantic complexity of the text corpus.
研究旨在通过估算自然语言的熵率来理解其中的冗余性,熵率约为每个字符一个比特。作者提出了一种统计模型,将文本分割成语义上连贯的片段,允许对文本结构进行分层分解。实验表明,该模型能够准确捕捉真实文本的熵率,并表明熵率随着语义复杂性的增加而系统地增加。
CoPE-VideoLM: Codec Primitives For Efficient Video Language Models
Authors: Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, Mihai Dusmanu
First: 2026-02-13T18:57:31+00:00 · Latest: 2026-02-13T18:57:31+00:00
Comments: Project Page: https://sayands.github.io/cope/
Abstract
Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to $86\%$ and token usage by up to $93\%$ compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on $14$ diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.
中文标题/摘要
标题:CoPE-VideoLM:视频语言模型的编解码器基础
视频语言模型(VideoLMs)使AI系统能够理解视频中的时间动态。为了适应最大上下文窗口限制,当前方法使用关键帧采样,这可能会因时间覆盖稀疏而错过宏观事件和微观细节。此外,处理每一帧的完整图像及其标记还会产生巨大的计算开销。为了解决这些限制,我们提出利用视频编解码器基础(特别是运动向量和残差),这些基础能够原生编码视频冗余和稀疏性,而无需对大多数帧进行昂贵的完整图像编码。为此,我们引入了轻量级的基于变压器的编码器,通过预训练策略将编解码器基础的表示与图像编码器嵌入对齐,从而加速端到端微调期间的收敛。我们的方法将时间到首个标记的时间减少了最多86%,标记使用量减少了最多93%,与标准VideoLMs相比。此外,通过调整关键帧和编解码器基础的密度,我们能够在涵盖一般问题回答、时间推理、长视频理解以及空间场景理解的14个不同视频理解基准测试中保持或超越性能。
Summary / 总结
The paper aims to improve the efficiency of Video Language Models (VideoLMs) by leveraging video codec primitives such as motion vectors and residuals, which encode video redundancy and sparsity without the need for full-image encoding. The authors introduce lightweight transformer-based encoders that pre-train on codec primitives and align their representations with image encoder embeddings, leading to a reduction in time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. The approach maintains or exceeds performance on 14 diverse video understanding benchmarks.
论文旨在通过利用视频编解码器的基本元素(如运动向量和残差),来提高视频语言模型(VideoLM)的效率,这些元素能够编码视频中的冗余和稀疏性,而无需进行完整的图像编码。作者引入了轻量级的基于变压器的编码器,这些编码器在预训练阶段对编解码器的基本元素进行训练,并将它们的表示与图像编码器的嵌入对齐,从而将首个标记的时间减少多达86%,并减少标记使用量多达93%。该方法在14个不同的视频理解基准测试中保持或超过了性能。
Learning-based Radio Link Failure Prediction Based on Measurement Dataset in Railway Environments
Authors: Po-Heng Chou, Da-Chih Lin, Hung-Yu Wei, Walid Saad, Yu Tsao
First: 2025-11-12T00:13:37+00:00 · Latest: 2026-02-13T18:53:36+00:00
Comments: 6 pages, 3 figures, 2 tables, and submitted to 2026 IEEE ICC Workshops
Abstract
This paper presents a measurement-driven case study on early radio link failure (RLF) warning as device-side network sensing and analytics for proactive mobility management in 5G non-standalone (NSA) railway environments. Using 10~Hz metro-train measurement traces with serving- and neighbor-cell indicators, we benchmark six representative learning models, including CNN, LSTM, XGBoost, Anomaly Transformer, PatchTST, and TimesNet, under multiple observation windows and prediction horizons. Rather than proposing a new prediction architecture, this study focuses on quantifying the feasibility of early warning and the trade-offs among observation context, prediction horizon, and alarm reliability under real railway mobility. Experimental results show that learning models can anticipate RLF-related reliability degradation seconds in advance using lightweight features available on commercial devices. The presented benchmark provides practical insights for sensing-assisted communication control, such as proactive redundancy activation and adaptive handover strategies, aligning with the 6G vision of integrating sensing and analytics into mobility control.
中文标题/摘要
标题:基于测量数据的铁路环境中基于学习的早期无线链路失败预测
本文基于测量数据,对5G非独立(NSA)铁路环境中设备侧网络感知与分析中的早期无线链路失败(RLF)预警进行了案例研究,以实现主动移动管理。使用10~Hz的地铁列车测量轨迹和服务小区及邻小区指示器,我们对六种代表性学习模型进行了基准测试,包括CNN、LSTM、XGBoost、异常变换器、PatchTST和TimesNet,考察了多种观测窗口和预测时间范围下的性能。本研究不提出新的预测架构,而是关注早期预警的可行性以及观测上下文、预测时间范围和警报可靠性之间的权衡。实验结果表明,学习模型可以利用商用设备上可用的轻量级特征,提前数秒预测RLF相关的可靠性退化。所提出的基准为基于感知的通信控制提供了实用见解,如主动冗余激活和自适应切换策略,与6G愿景中将感知与分析集成到移动控制中相一致。
Summary / 总结
This paper investigates early radio link failure (RLF) prediction in 5G NSA railway environments using 10 Hz metro-train measurement traces. Six learning models (CNN, LSTM, XGBoost, Anomaly Transformer, PatchTST, and TimesNet) are benchmarked under various observation windows and prediction horizons. The study finds that these models can predict RLF-related reliability degradation seconds in advance using lightweight device features, offering practical insights for proactive network management strategies.
该研究评估了六种机器学习模型(CNN、LSTM、XGBoost、Anomaly Transformer、PatchTST 和 TimesNet),使用10 Hz 测量轨迹来预测5G NSA 铁路环境中的早期无线链路失败。研究发现,这些模型可以使用轻量级的设备侧特征提前几秒预测RLF相关的可靠性下降,为冗余激活和自适应切换等主动网络管理策略提供了实用见解。基准测试结果突出了在实际铁路移动场景中观察上下文、预测时间范围和警报可靠性的权衡。
R-Zero: Self-Evolving Reasoning LLM from Zero Data
Authors: Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, Dong Yu
First: 2025-08-07T03:38:16+00:00 · Latest: 2026-02-13T18:53:32+00:00
Abstract
Self-evolving Large Language Models (LLMs) offer a scalable path toward super-intelligence by autonomously generating, refining, and learning from their own experiences. However, existing methods for training such models still rely heavily on vast human-curated tasks and labels, typically via fine-tuning or reinforcement learning, which poses a fundamental bottleneck to advancing AI systems toward capabilities beyond human intelligence. To overcome this limitation, we introduce R-Zero, a fully autonomous framework that generates its own training data from scratch. Starting from a single base LLM, R-Zero initializes two independent models with distinct roles, a Challenger and a Solver. These models are optimized separately and co-evolve through interaction: the Challenger is rewarded for proposing tasks near the edge of the Solver capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger. This process yields a targeted, self-improving curriculum without any pre-existing tasks and labels. Empirically, R-Zero substantially improves reasoning capability across different backbone LLMs, e.g., boosting the Qwen3-4B-Base by +6.49 on math-reasoning benchmarks and +7.54 on general-domain reasoning benchmarks.
中文标题/摘要
标题:R-Zero: 从零数据自我进化的推理大语言模型
自我进化的大型语言模型(LLMs)通过自主生成、改进和从自身经验中学习,提供了一条通向超智能的可扩展路径。然而,现有方法在训练此类模型时仍然高度依赖大量的人工标注任务和标签,通常通过微调或强化学习实现,这构成了推动AI系统达到超越人类智能能力的瓶颈。为克服这一限制,我们引入了R-Zero,这是一种完全自主的框架,从零开始生成自己的训练数据。从一个基础LLM开始,R-Zero初始化两个具有不同角色的独立模型,一个挑战者和一个解决者。这些模型分别优化并相互进化:挑战者因提出接近解决者能力边缘的任务而获得奖励,解决者因解决挑战者提出的越来越具挑战性的任务而获得奖励。这一过程产生了一个有针对性的、自我改进的课程,无需任何预先存在的任务和标签。实验证明,R-Zero显著提高了不同基础LLM的推理能力,例如,在数学推理基准测试中将Qwen3-4B-Base的性能提升了6.49,在通用领域推理基准测试中提升了7.54。
Summary / 总结
R-Zero is a self-evolving framework that generates its own training data autonomously, starting from a single base LLM. It introduces a Challenger and a Solver, which co-evolve through interaction, with the Challenger proposing increasingly challenging tasks and the Solver solving them. This process enhances reasoning capabilities across different LLMs, notably improving Qwen3-4B-Base by 6.49 on math-reasoning benchmarks and 7.54 on general-domain reasoning benchmarks.
R-Zero 是一个从零开始生成训练数据的自主框架,从一个基础 LLM 开始,引入挑战者和解题者,通过互动共同进化,挑战者提出解题者能力边缘的任务,解题者解决越来越具挑战性的任务。这一过程形成了一种针对性的、自我提升的课程。实验结果显示,R-Zero 显著提升了不同基础 LLM 的推理能力,Qwen3-4B-Base 在数学推理基准上的表现提升了 +6.49,一般领域推理基准上的表现提升了 +7.54。
tLoRA: Efficient Multi-LoRA Training with Elastic Shared Super-Models
Authors: Kevin Li, Dibyadeep Saha, Avni Kanodia, Fan Lai
First: 2026-02-06T23:26:02+00:00 · Latest: 2026-02-13T18:35:06+00:00
Abstract
As Low-Rank Adaptation (LoRA) becomes the standard approach for efficiently fine-tuning large language models (LLMs), shared clusters increasingly execute many concurrent LoRA training jobs over the same frozen backbone. While recent advances enable batching (co-locating) multiple adapters during serving, efficient training-time co-location of heterogeneous LoRA adapters presents unique challenges. Jobs often differ in adapter rank, batch size, and resource allocation, and naïve batching can introduce synchronization stalls, communication overheads, and per-job slowdowns that are worse than executing independently. We introduce tLoRA, a framework that enables efficient batch training of multiple LoRA jobs. tLoRA fuses adapters that share the same base model into an elastic shared super-model, exploiting existing distributed training frameworks to derive parallelism plans that share resources effectively. At the kernel level, tLoRA employs a fused LoRA kernel that adaptively reconstructs low-rank computation tiles and schedules rank-aware nano-batches to maximize overlap between computation and communication across adapters. At the scheduling layer, tLoRA incorporates an online, residual-capacity-aware scheduler that adaptively groups jobs to maximize collective throughput. Evaluations using real-world cluster traces demonstrate that tLoRA improves training throughput by 1.2--1.8x, job training completion time by 2.3--5.4x, and GPU utilization by 37%.
中文标题/摘要
标题:tLoRA:高效的弹性共享超模型多LoRA训练
随着低秩适应(LoRA)成为高效微调大型语言模型(LLMs)的标准方法,共享集群越来越多地在同一冻结主模型上执行多个并发的LoRA训练任务。虽然最近的进步使多个适配器在服务时能够批量处理(共定位),但在训练时间上高效地共定位异构LoRA适配器却提出了独特的挑战。任务往往在适配器秩、批量大小和资源分配方面有所不同,而简单的批量处理可能会引入同步停滞、通信开销以及比独立执行更糟糕的每任务减速。我们提出了tLoRA框架,该框架能够高效地批量训练多个LoRA任务。tLoRA将具有相同基础模型的适配器融合成一个弹性共享超模型,并利用现有的分布式训练框架来制定有效的资源共享并行计划。在内核级别,tLoRA采用了一个融合的LoRA内核,能够自适应地重构低秩计算块并调度秩感知的纳米批处理,以最大化适配器之间计算和通信的重叠。在调度层面上,tLoRA引入了一个在线的、残差容量感知调度器,能够自适应地分组任务以最大化集体吞吐量。使用实际集群跟踪的评估表明,tLoRA将训练吞吐量提高了1.2-1.8倍,任务训练完成时间提高了2.3-5.4倍,并且GPU利用率提高了37%。
Summary / 总结
tLoRA is a framework that enables efficient batch training of multiple LoRA jobs by fusing shared base models into an elastic shared super-model, which uses distributed training frameworks to share resources and employs a fused LoRA kernel and an online scheduler to maximize throughput and utilization. Evaluations show that tLoRA improves training throughput by 1.2--1.8x, job training completion time by 2.3--5.4x, and GPU utilization by 37%.
tLoRA 是一个框架,通过将共享的基础模型融合成一个弹性共享超级模型来实现多个 LoRA 任务的高效批处理训练,从而提高 1.2--1.8 倍的训练吞吐量,减少 2.3--5.4 倍的训练完成时间,并提高 37% 的 GPU 利用率。它使用融合的 LoRA 内核和在线调度器来最大化资源共享和吞吐量。
MissionHD: Hyperdimensional Refinement of Distribution-Deficient Reasoning Graphs for Video Anomaly Detection
Authors: Sanggeon Yun, Raheeb Hassan, Ryozo Masukawa, Nathaniel D. Bastian, Mohsen Imani
First: 2025-08-20T14:43:04+00:00 · Latest: 2026-02-13T18:30:09+00:00
Abstract
LLM-generated reasoning graphs, referred to as mission-specific graphs (MSGs), are increasingly used for video anomaly detection (VAD) and recognition (VAR). However, they are typically treated as fixed despite being generic and distribution-deficient. Conventional graph structure refinement (GSR) methods are ill-suited to this setting, as they rely on learning structural distributions that are absent in LLM-generated graphs. We propose HDC-constrained Graph Structure Refinement (HDC-GSR), a new paradigm that directly optimizes a decodable, task-aligned graph representation in a single hyperdimensional space without distribution modeling. Leveraging Hyperdimensional Computing (HDC), our framework encodes graphs via binding and bundling operations, aligns the resulting graph code with downstream loss, and decodes edge contributions to refine the structure. We instantiate this approach as MissionHD for weakly supervised VAD/VAR and demonstrate consistent performance gains on benchmark datasets.
中文标题/摘要
标题:MissionHD:超维度分布精炼的使命特定推理图视频异常检测
由LLM生成的推理图被称为使命特定图(MSGs),它们越来越多地用于视频异常检测(VAD)和识别(VAR)。然而,它们通常被视为固定的,尽管它们是通用且分布不足的。传统的图结构精炼(GSR)方法在这种情况下并不适用,因为它们依赖于学习LLM生成图中不存在的结构分布。我们提出了HDC约束下的图结构精炼(HDC-GSR),这是一种新的范式,直接在超维度空间中优化一个可解码的任务对齐的图表示,而无需进行分布建模。利用超维度计算(HDC),我们的框架通过绑定和捆绑操作编码图,使结果图代码与下游损失对齐,并解码边贡献以精炼结构。我们将此方法实例化为MissionHD,用于弱监督VAD/VAR,并在基准数据集上展示了持续的性能提升。
Summary / 总结
The research aims to improve video anomaly detection and recognition by refining generic reasoning graphs generated by large language models (LLMs). The proposed HDC-GSR method directly optimizes a task-aligned graph representation in a hyperdimensional space without modeling distributions, using Hyperdimensional Computing (HDC). Experiments show consistent performance gains on benchmark datasets for weakly supervised video anomaly detection and recognition.
研究旨在通过利用LLM生成的、通常被视为固定且分布不足的使命特定图来提升视频异常检测和识别的性能。提出的HDC-GSR方法直接在一个超维度空间中优化任务对齐的图表示,而不建模分布。该方法通过绑定和捆绑操作编码图,使图代码与下游损失对齐,并通过解码边贡献来优化结构,从而在基准数据集上实现一致的性能提升。
Asynchronous Verified Semantic Caching for Tiered LLM Architectures
Authors: Asmit Kumar Singh, Haozhe Wang, Laxmi Naga Santosh Attaluri, Tak Chiam, Weihua Zhu
First: 2026-02-13T18:25:00+00:00 · Latest: 2026-02-13T18:25:00+00:00
Abstract
Large language models (LLMs) now sit in the critical path of search, assistance, and agentic workflows, making semantic caching essential for reducing inference cost and latency. Production deployments typically use a tiered static-dynamic design: a static cache of curated, offline vetted responses mined from logs, backed by a dynamic cache populated online. In practice, both tiers are commonly governed by a single embedding similarity threshold, which induces a hard tradeoff: conservative thresholds miss safe reuse opportunities, while aggressive thresholds risk serving semantically incorrect responses. We introduce \textbf{Krites}, an asynchronous, LLM-judged caching policy that expands static coverage without changing serving decisions. On the critical path, Krites behaves exactly like a standard static threshold policy. When the nearest static neighbor of the prompt falls just below the static threshold, Krites asynchronously invokes an LLM judge to verify whether the static response is acceptable for the new prompt. Approved matches are promoted into the dynamic cache, allowing future repeats and paraphrases to reuse curated static answers and expanding static reach over time. In trace-driven simulations on conversational and search workloads, Krites increases the fraction of requests served with curated static answers (direct static hits plus verified promotions) by up to $\textbf{3.9}$ times for conversational traffic and search-style queries relative to tuned baselines, with unchanged critical path latency.
中文标题/摘要
标题:异步验证语义缓存用于分层LLM架构
大型语言模型(LLMs)现在处于搜索、辅助和代理工作流的关键路径上,使得语义缓存对于减少推理成本和延迟变得至关重要。生产部署通常采用分层的静态-动态设计:一个静态缓存,包含从日志中挖掘出的经过离线审核的精选响应,由一个在线填充的动态缓存支持。实际上,两个层级通常由单一的嵌入相似度阈值共同管理,这导致了硬性的权衡:保守的阈值会错过安全的重用机会,而激进的阈值则有风险提供语义上不正确的响应。我们引入了**Krites**,一种异步的、由LLM判断的缓存策略,它在不改变服务决策的情况下扩展静态覆盖范围。在关键路径上,Krites的行为与标准静态阈值策略完全相同。当提示的最近静态邻居刚好低于静态阈值时,Krites会异步调用LLM判断,验证静态响应是否适用于新提示。批准的匹配项会被提升到动态缓存中,允许未来重复和改写使用精选的静态答案,并随着时间的推移扩展静态覆盖范围。在基于跟踪的模拟中,对于对话流量和搜索风格的查询,Krites将使用精选静态答案(直接静态命中加上验证提升)的比例提高了最多**3.9**倍,同时保持了关键路径延迟不变。
Summary / 总结
The paper introduces Krites, an asynchronous verified semantic caching policy designed to enhance the efficiency and accuracy of large language models (LLMs) in tiered architectures. Krites expands the static cache coverage without altering serving decisions, using an LLM judge to verify static responses when they fall just below the threshold. This approach significantly increases the fraction of requests served with curated static answers, up to 3.9 times for conversational traffic and search queries, while maintaining the same critical path latency as baseline models.
论文通过引入Krites异步缓存策略解决了大型语言模型(LLMs)中的语义缓存问题。Krites利用LLM验证静态缓存响应,扩大静态覆盖范围而不改变服务决策。在模拟中,Krites显著提高了以受curated静态答案服务的请求数量,相比调优基线最多提高了3.9倍,同时保持了相同的关键路径延迟。
Learnable Chernoff Baselines for Inference-Time Alignment
Authors: Sunil Madhow, Yuchen Liang, Ness Shroff, Yingbin Liang, Yu-Xiang Wang
First: 2026-02-08T00:09:40+00:00 · Latest: 2026-02-13T18:15:21+00:00
Abstract
We study inference-time reward-guided alignment for generative models. Existing methods often rely on either architecture-specific adaptations or computationally costly inference procedures. We introduce Learnable Chernoff Baselines (LCBs) as a method for efficiently and approximately sampling from the exponentially tilted kernels that arise from KL-regularized reward alignment. Using only black-box sampling access to the pretrained model, LCBs implement a form of rejection sampling with adaptively selected acceptance probabilities, which allows fine-grained control over inference-compute scaling. We establish total-variation guarantees to the ideal aligned model, and demonstrate in both continuous and discrete diffusion settings that LCB sampling closely matches ideal rejection sampling while using substantially fewer queries to the pretrained model.
中文标题/摘要
标题:可学习的切诺夫基线用于推理时对齐
我们研究了生成模型的推理时奖励导向对齐。现有方法通常依赖于特定架构的适应或计算成本高昂的推理程序。我们引入了可学习的切诺夫基线(LCBs),作为一种高效且近似地从KL正则化奖励对齐中产生的指数倾斜核进行采样的方法。仅使用对预训练模型的黑盒采样访问,LCBs 实现了一种带有自适应选择接受概率的拒绝采样形式,这允许对推理计算量进行精细控制。我们建立了与理想对齐模型的总变差保证,并在连续和离散扩散设置中证明,LCBs 采样与理想的拒绝采样非常接近,同时使用了显著较少的对预训练模型的查询次数。
Summary / 总结
The research aims to improve inference-time reward-guided alignment for generative models by addressing the limitations of existing methods, which either require architecture-specific adaptations or are computationally expensive. The study introduces Learnable Chernoff Baselines (LCBs) to efficiently sample from exponentially tilted kernels using black-box access to the pretrained model. The method employs rejection sampling with adaptively selected acceptance probabilities, enabling fine-grained control over inference-compute trade-offs. Experiments in continuous and discrete diffusion settings show that LCBs closely match ideal rejection sampling while using significantly fewer queries to the pretrained model, establishing total-variation guarantees to the ideal aligned model.
研究旨在通过解决现有方法的局限性来改进生成模型的推理时奖励导向对齐,这些方法要么需要特定架构的适应,要么计算成本高昂。研究引入了可学习的切尔诺夫基线(LCBs),通过黑盒访问预训练模型来高效地从指数倾斜内核中采样。该方法采用自适应选择接受概率的拒绝采样,能够精细地控制推理计算权衡。实验在连续和离散扩散设置中表明,LCBs在使用显著较少的预训练模型查询次数的同时,能够接近理想的拒绝采样,建立了与理想对齐模型的总变差保证。
In-Context Autonomous Network Incident Response: An End-to-End Large Language Model Agent Approach
Authors: Yiran Gao, Kim Hammar, Tao Li
Venue: AAAI
First: 2026-02-13T18:09:30+00:00 · Latest: 2026-02-13T18:09:30+00:00
Comments: 2026 AAAI Summer Symposium on Human-Aware AI Agents for the Cyber Battlefield
Abstract
Rapidly evolving cyberattacks demand incident response systems that can autonomously learn and adapt to changing threats. Prior work has extensively explored the reinforcement learning approach, which involves learning response strategies through extensive simulation of the incident. While this approach can be effective, it requires handcrafted modeling of the simulator and suppresses useful semantics from raw system logs and alerts. To address these limitations, we propose to leverage large language models' (LLM) pre-trained security knowledge and in-context learning to create an end-to-end agentic solution for incident response planning. Specifically, our agent integrates four functionalities, perception, reasoning, planning, and action, into one lightweight LLM (14b model). Through fine-tuning and chain-of-thought reasoning, our LLM agent is capable of processing system logs and inferring the underlying network state (perception), updating its conjecture of attack models (reasoning), simulating consequences under different response strategies (planning), and generating an effective response (action). By comparing LLM-simulated outcomes with actual observations, the LLM agent repeatedly refines its attack conjecture and corresponding response, thereby demonstrating in-context adaptation. Our agentic approach is free of modeling and can run on commodity hardware. When evaluated on incident logs reported in the literature, our agent achieves recovery up to 23% faster than those of frontier LLMs.
中文标题/摘要
标题:上下文自适应网络事件响应:端到端大型语言模型代理方法
快速演化的网络攻击要求事件响应系统能够自主学习并适应不断变化的威胁。先前的工作已经广泛探索了强化学习的方法,该方法通过模拟事件的大量仿真来学习响应策略。虽然这种方法可能有效,但它需要手工构建模拟器模型,并抑制了从原始系统日志和警报中提取的有用语义。为了解决这些限制,我们提出利用大型语言模型(LLM)的预训练安全知识和上下文学习能力,创建一个端到端的代理解决方案来规划事件响应。具体而言,我们的代理将感知、推理、规划和行动四个功能整合到一个轻量级的LLM(14b模型)中。通过微调和链式思考推理,我们的LLM代理能够处理系统日志并推断出底层网络状态(感知),更新其对攻击模型的猜测(推理),在不同的响应策略下模拟后果(规划),并生成有效的响应(行动)。通过将LLM模拟结果与实际观察结果进行比较,LLM代理不断优化其对攻击的猜测及其相应的响应,从而展示了上下文适应能力。我们的代理方法无需建模,可以在普通硬件上运行。当在文献中报告的事件日志上进行评估时,我们的代理比前沿的LLM快23%以上实现恢复。
Summary / 总结
This paper addresses the need for autonomous incident response systems capable of learning and adapting to evolving cyberattacks. It proposes an end-to-end agent using large language models (LLMs) for perception, reasoning, planning, and action. The agent, fine-tuned and utilizing chain-of-thought reasoning, processes system logs, infers network states, updates attack models, simulates response strategies, and generates effective responses. Evaluation shows the agent recovers from incidents up to 23% faster than leading LLMs.
本文旨在应对不断演变的网络攻击,提出了一种端到端的基于大语言模型(LLM)的自主响应系统。该系统整合了感知、推理、规划和行动功能,通过微调和链式思考推理处理系统日志,推断网络状态,模拟响应策略并生成有效响应。评估结果显示,该系统比领先的LLM快23%以上完成故障恢复。
Choose Your Agent: Tradeoffs in Adopting AI Advisors, Coaches, and Delegates in Multi-Party Negotiation
Authors: Kehang Zhu, Nithum Thain, Vivian Tsai, James Wexler, Crystal Qian
First: 2026-02-12T15:41:57+00:00 · Latest: 2026-02-13T18:08:38+00:00
Abstract
As AI usage becomes more prevalent in social contexts, understanding agent-user interaction is critical to designing systems that improve both individual and group outcomes. We present an online behavioral experiment (N = 243) in which participants play three multi-turn bargaining games in groups of three. Each game, presented in randomized order, grants access to a single LLM assistance modality: proactive recommendations from an Advisor, reactive feedback from a Coach, or autonomous execution by a Delegate; all modalities are powered by an underlying LLM that achieves superhuman performance in an all-agent environment. On each turn, participants privately decide whether to act manually or use the AI modality available in that game. Despite preferring the Advisor modality, participants achieve the highest mean individual gains with the Delegate, demonstrating a preference-performance misalignment. Moreover, delegation generates positive externalities; even non-adopting users in access-to-delegate treatment groups benefit by receiving higher-quality offers. Mechanism analysis reveals that the Delegate agent acts as a market maker, injecting rational, Pareto-improving proposals that restructure the trading environment. Our research reveals a gap between agent capabilities and realized group welfare. While autonomous agents can exhibit super-human strategic performance, their impact on realized welfare gains can be constrained by interfaces, user perceptions, and adoption barriers. Assistance modalities should be designed as mechanisms with endogenous participation; adoption-compatible interaction rules are a prerequisite to improving human welfare with automated assistance.
中文标题/摘要
标题:选择您的代理:在多边谈判中采用AI顾问、教练和代理的权衡
随着AI在社会环境中的使用越来越普遍,理解代理与用户之间的互动对于设计能够改善个人和群体结果的系统至关重要。我们进行了一项在线行为实验(N = 243),参与者以三人一组的形式进行三次多回合的讨价还价游戏。每轮游戏以随机顺序呈现,提供一种LLM辅助模式:顾问的主动建议、教练的反应性反馈或代理的自主执行;所有模式均由一个在全代理环境中实现超人表现的LLM提供支持。在每轮中,参与者私下决定是否手动行动或使用该轮游戏中可用的AI模式。尽管参与者更偏好顾问模式,但他们使用代理模式时实现了最高的平均个人收益,显示出偏好与性能之间的不匹配。此外,代理还产生了正外部性;即使在可以使用代理的组中未采用代理的用户也能从中受益,因为他们收到了质量更高的报价。机制分析表明,代理代理充当了市场中介,注入了理性的、帕累托改进的提议,重新构架了交易环境。我们的研究揭示了代理能力与实现的群体福利之间的差距。虽然自主代理可以表现出超人的战略表现,但它们对实现的福利增益的影响可能会受到界面、用户感知和采用障碍的限制。辅助模式应被设计为具有内生参与机制的机制;与采用兼容的交互规则是利用自动化辅助改善人类福利的前提条件。
Summary / 总结
The study investigates how different AI assistance modalities (Advisor, Coach, Delegate) affect individual and group outcomes in multi-party negotiation. Through an online experiment with 243 participants, it was found that while participants preferred the Advisor, they achieved the highest individual gains with the Delegate. This indicates a preference-performance mismatch. Additionally, delegation had positive externalities, benefiting non-users as well. The Delegate acted as a market maker, improving the trading environment. The research highlights the need for AI assistance modalities to be designed with endogenous participation to maximize human welfare.
研究探讨了不同AI辅助模式(顾问、教练、代理)在多方谈判中的影响。通过一项包含243名参与者的在线实验,发现尽管参与者更偏好顾问模式,但代理模式却带来了最高的个人收益。这表明偏好与性能之间存在差距。此外,代理模式还产生了外部正效应,即使未使用该代理的参与者也从中受益。代理代理充当了市场中介,改善了交易环境。研究强调,AI辅助模式需要设计为具有内生参与的机制,以便最大化人类福利。
Quantization-Robust LLM Unlearning via Low-Rank Adaptation
Authors: João Vitor Boer Abitante, Joana Meneguzzo Pasquali, Luan Fonseca Garcia, Ewerton de Oliveira, Thomas da Silva Paula, Rodrigo C. Barros, Lucas S. Kupssinskü
First: 2026-02-13T18:01:40+00:00 · Latest: 2026-02-13T18:01:40+00:00
Abstract
Large Language Model (LLM) unlearning aims to remove targeted knowledge from a trained model, but practical deployments often require post-training quantization (PTQ) for efficient inference. However, aggressive low-bit PTQ can mask or erase unlearning updates, causing quantized models to revert to pre-unlearning behavior. We show that standard full-parameter fine-tuning often induce parameter changes that are too small to survive 4-bit quantization. We propose quantization-robust unlearning via low-rank adaptation (LoRA): we freeze the base model and concentrate unlearning into trainable adapters so that the effective update is preserved after quantization. On Llama-2-7B evaluated with MUSE dataset (BOOKS and NEWS), LoRA improves 4-bit utility by up to 7.93 points (NPO+GDR on BOOKS: 50.17 to 58.10) and yields higher 4-bit utility on NEWS for GA+GDR (40.06 to 44.82, increase of 4.76). LoRA also substantially reduces privacy leakage under 4-bit PTQ, e.g., for GA+KLR on BOOKS, PrivLeak moves from -25.68 to -5.86 (closer to ideal 0), while maintaining strong forgetting (VerMem and KnowMem near 0). Thus, using LoRA for Machine Unlearning is beneficial for scenarios where quantization is necessary for model deployment.
中文标题/摘要
标题:量化稳健的大语言模型去学习通过低秩适应
大语言模型(LLM)去学习旨在从训练模型中移除目标知识,但实际部署通常需要后训练量化(PTQ)以实现高效推理。然而,激进的低位宽PTQ可能会掩盖或擦除去学习更新,导致量化模型恢复到去学习前的行为。我们展示了标准的全参数微调通常会导致参数变化太小,无法在4位量化后存活。我们提出了一种量化稳健的去学习方法,通过低秩适应(LoRA):我们冻结基础模型,并将去学习集中在可训练的适配器中,以确保量化后的有效更新得以保留。在使用MUSE数据集(BOOKS和NEWS)评估的Llama-2-7B上,LoRA将4位的实用性提高了最多7.93个点(NPO+GDR在BOOKS上:50.17到58.10),并在NEWS上也获得了更高的4位实用性(GA+GDR:40.06到44.82,增加4.76)。LoRA还显著减少了4位PTQ下的隐私泄露,例如,在GA+KLR在BOOKS上的情况下,PrivLeak从-25.68变为-5.86(更接近理想的0),同时保持了强大的遗忘(VerMem和KnowMem接近0)。因此,在需要量化部署的场景中,使用LoRA进行机器去学习是有益的。
Summary / 总结
The paper addresses the challenge of unlearning targeted knowledge from large language models (LLMs) while maintaining their performance after post-training quantization (PTQ). It proposes quantization-robust unlearning via low-rank adaptation (LoRA), which involves freezing the base model and concentrating unlearning updates in trainable adapters. This method preserves effective updates after 4-bit quantization, improving utility by up to 7.93 points on the MUSE dataset and reducing privacy leakage while maintaining strong forgetting. On Llama-2-7B, LoRA enhances 4-bit utility and privacy protection compared to standard full-parameter fine-tuning.
论文解决了在进行后训练量化的同时从大型语言模型(LLM)中删除特定知识的挑战。它提出了一种基于低秩适应(LoRA)的量化鲁棒去学习方法,该方法将去学习更新集中在可训练的适配器上,同时冻结基础模型。这种方法在4比特量化下提高了高达7.93个点的性能,并减少了隐私泄露,使其适用于需要量化以实现高效推理的场景。
Highlight & Summarize: RAG without the jailbreaks
Authors: Giovanni Cherubin, Andrew Paverd
First: 2025-08-04T20:01:00+00:00 · Latest: 2026-02-13T17:48:19+00:00
Abstract
Preventing jailbreaking and model hijacking of Large Language Models (LLMs) is an important yet challenging task. When interacting with a chatbot, malicious users can input specially crafted prompts that cause the LLM to generate undesirable content or perform a different task from its intended purpose. Existing systems attempt to mitigate this by hardening the LLM's system prompt or using additional classifiers to detect undesirable content or off-topic conversations. However, these probabilistic approaches are relatively easy to bypass due to the very large space of possible inputs and undesirable outputs. We present and evaluate Highlight & Summarize (H&S), a new design pattern for retrieval-augmented generation (RAG) systems that prevents these attacks by design. The core idea is to perform the same task as a standard RAG pipeline (i.e., to provide natural language answers to questions, based on relevant sources) without ever revealing the user's question to the generative LLM. This is achieved by splitting the pipeline into two components: a highlighter, which takes the user's question and extracts ("highlights") relevant passages from the retrieved documents, and a summarizer, which takes the highlighted passages and summarizes them into a cohesive answer. We describe and implement several possible instantiations of H&S and evaluate their responses in terms of correctness, relevance, and quality. For certain question-answering (QA) tasks, the responses produced by H&S are judged to be as good, if not better, than those of a standard RAG pipeline.
中文标题/摘要
标题:突出显示与总结:无需越狱的RAG
防止大型语言模型(LLMs)的越狱和模型劫持是一项重要但具有挑战性的任务。在与聊天机器人交互时,恶意用户可以输入特制的提示,使LLM生成不希望的内容或执行与其预期目的不同的任务。现有系统试图通过强化LLM的系统提示或使用额外的分类器来检测不希望的内容或离题的对话来减轻这种风险。然而,这些概率方法由于可能输入和不希望输出的非常大的空间而相对容易被绕过。我们提出了并评估了Highlight & Summarize(H&S),这是一种新的检索增强生成(RAG)系统的设计模式,通过设计防止这些攻击。核心思想是像标准RAG流水线一样执行相同任务(即,基于相关来源提供自然语言答案),但从未向生成性LLM透露用户的提问。这通过将流水线分为两个组件来实现:一个突出显示器,它接受用户的提问并从检索到的文档中提取(“突出显示”)相关段落;一个总结器,它接受突出显示的段落并将其总结为连贯的答案。我们描述并实现了H&S的几种可能的实现方式,并从正确性、相关性和质量方面评估其响应。对于某些问答(QA)任务,H&S生成的响应被认为与标准RAG流水线生成的响应一样好,甚至更好。
Summary / 总结
The paper addresses the challenge of preventing jailbreaking and model hijacking in Large Language Models (LLMs) by introducing Highlight & Summarize (H&S), a new design pattern for retrieval-augmented generation (RAG) systems. H&S ensures that the user's question is never revealed to the generative LLM, thereby preventing undesirable content generation. The system splits the pipeline into a highlighter and a summarizer, which process retrieved documents and generate a cohesive answer, respectively. Experimental results show that H&S produces responses that are as good as, if not better than, those from a standard RAG pipeline for certain QA tasks.
本文提出了一种新的检索增强生成系统设计模式Highlight & Summarize (H&S),以防止大型语言模型(LLMs)的越狱和模型劫持问题。H&S 将流程分为高亮和总结两个部分,确保生成模型不会看到用户的提问,从而降低生成不良内容的风险。评估结果显示,对于某些问答任务,H&S 的响应与标准 RAG 流程的响应一样好,甚至更好。
Weight Decay may matter more than muP for Learning Rate Transfer in Practice
Authors: Atli Kosson, Jeremy Welborn, Yang Liu, Martin Jaggi, Xi Chen
Venue: ICLR 2026
First: 2025-10-21T21:36:14+00:00 · Latest: 2026-02-13T17:48:03+00:00
Comments: ICLR 2026
Abstract
Transferring the optimal learning rate from small to large neural networks can enable efficient training at scales where hyperparameter tuning is otherwise prohibitively expensive. To this end, the Maximal Update Parameterization (muP) proposes a learning rate scaling designed to keep the update dynamics of internal representations stable across different model widths. However, the scaling rules of muP rely on strong assumptions, particularly about the geometric alignment of a layer's inputs with both its weights and gradient updates. In this large-scale empirical investigation, we show that these assumptions hold only briefly at the start of training in the practical setups where learning rate transfer is most valuable, such as LLM training. For the remainder of training it is weight decay rather than muP that correctly stabilizes the update dynamics of internal representations across widths, facilitating learning rate transfer. This suggests muP's scaling primarily acts as a form of implicit learning rate warmup, allowing us to largely replace it with modified warmup schedules. Together these findings fundamentally challenge prevailing beliefs about learning rate transfer and can explain empirical observations such as why muP requires the independent weight decay variant for good transfer.
中文标题/摘要
标题:权重衰减可能比muP在实践中对学习率转移更重要
将小型神经网络中的最优学习率转移到大型神经网络中,可以在超参数调优成本高昂的情况下实现高效的训练。为此,Maximal Update Parameterization (muP) 提出了一种学习率缩放方法,旨在保持不同模型宽度内部表示的更新动态稳定。然而,muP 的缩放规则依赖于一些强有力的假设,特别是关于层的输入与权重及梯度更新之间的几何对齐。在本大规模实证研究中,我们表明,在如大语言模型(LLM)训练等最需要学习率转移的实用设置中,这些假设仅在训练初期短暂成立。在训练的其余时间里,是权重衰减而不是muP 正确地稳定了不同宽度内部表示的更新动态,促进了学习率转移。这表明muP 的缩放主要作为一种隐式的学习率预热,使我们能够用修改后的预热计划表来很大程度上替代它。这些发现从根本上挑战了关于学习率转移的现有观点,并可以解释诸如为什么muP 需要独立的权重衰减变体才能实现良好转移等经验观察。
Summary / 总结
This study investigates the effectiveness of Maximal Update Parameterization (muP) for learning rate transfer between small and large neural networks. While muP aims to stabilize update dynamics by scaling learning rates based on geometric assumptions, the study finds that these assumptions hold only briefly at the start of training. Weight decay is shown to be more effective for stabilizing update dynamics throughout most of training, facilitating efficient learning rate transfer. This challenges the prevailing belief in muP's importance and suggests that modified warmup schedules can largely replace muP for learning rate transfer.
研究探讨了Maximal Update Parameterization (muP)和权重衰减在小规模和大规模神经网络之间实现学习率转移的有效性。研究挑战了muP的几何对齐假设在整个训练过程中都成立的假设,表明在整个训练过程中权重衰减比muP更关键,用于稳定不同宽度模型中的更新动态。研究结果表明,muP的主要作用是作为隐式的学习率预热,其缩放规则可以被修改的预热计划所替代,从根本上改变了对学习率转移机制的理解。
From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems
Authors: Marcos Ortiz, Justin Hill, Collin Overbay, Ingrida Semenec, Frederic Sauve-Hoover, Jim Schwoebel, Joel Shor
First: 2025-12-19T21:37:15+00:00 · Latest: 2026-02-13T17:26:15+00:00
Abstract
Agentic AI systems capable of generating full-stack web applications from natural language prompts ("prompt- to-app") represent a significant shift in software development. However, evaluating these systems remains challenging, as visual polish, functional correctness, and user trust are often misaligned. As a result, it is unclear how existing prompt-to-app tools compare under realistic, human-centered evaluation criteria. In this paper, we introduce a human-centered benchmark for evaluating prompt-to-app systems and conduct a large-scale comparative study of three widely used platforms: Replit, Bolt, and Firebase Studio. Using a diverse set of 96 prompts spanning common web application tasks, we generate 288 unique application artifacts. We evaluate these systems through a large-scale human-rater study involving 205 participants and 1,071 quality-filtered pairwise comparisons, assessing task-based ease of use, visual appeal, perceived completeness, and user trust. Our results show that these systems are not interchangeable: Firebase Studio consistently outperforms competing platforms across all human-evaluated dimensions, achieving the highest win rates for ease of use, trust, visual appeal, and visual appropriateness. Bolt performs competitively on visual appeal but trails Firebase on usability and trust, while Replit underperforms relative to both across most metrics. These findings highlight a persistent gap between visual polish and functional reliability in prompt-to-app systems and demonstrate the necessity of interactive, task-based evaluation. We release our benchmark framework, prompt set, and generated artifacts to support reproducible evaluation and future research in agentic application generation.
中文标题/摘要
标题:从提示到产品:代理应用程序生成系统的以人为本基准
能够从自然语言提示生成全栈网络应用程序的代理AI系统代表了软件开发中的重大转变。然而,评估这些系统仍然具有挑战性,因为视觉美观、功能正确性和用户信任往往不一致。因此,在现实的人本评价标准下,现有提示到应用程序工具之间的比较尚不清楚。在本文中,我们引入了一种以人为本的基准来评估提示到应用程序系统,并对三个广泛使用的平台进行了大规模比较研究:Replit、Bolt和Firebase Studio。使用涵盖常见网络应用程序任务的96个提示集,我们生成了288个独特的应用程序制品。我们通过涉及205名参与者和1,071对质量筛选后的成对比较的大规模人工评分研究,评估了这些系统的任务易用性、视觉吸引力、感知完整性以及用户信任。我们的结果显示,这些系统并非互换的:Firebase Studio在所有人工评估维度上始终优于竞争对手平台,获得最高的易用性、信任、视觉吸引力和视觉适宜性胜率。Bolt在视觉吸引力方面表现竞争,但在易用性和信任方面落后于Firebase,而Replit在大多数指标上相对于两者都表现不佳。这些发现突显了提示到应用程序系统中视觉美观与功能可靠性之间持续存在的差距,并证明了交互式、任务导向评估的必要性。我们发布了我们的基准框架、提示集和生成的制品,以支持可重复的评估和未来代理应用程序生成的研究。
Summary / 总结
This paper introduces a human-centered benchmark for evaluating agentic AI systems that generate full-stack web applications from natural language prompts. Using a diverse set of 96 prompts, the study compares Replit, Bolt, and Firebase Studio across 288 unique application artifacts. Through a large-scale human-rater study, it was found that Firebase Studio outperforms the other platforms in terms of ease of use, visual appeal, perceived completeness, and user trust. Bolt performs well in visual appeal but lags behind in usability and trust, while Replit underperforms in most metrics. The study highlights the gap between visual polish and functional reliability in prompt-to-app systems and emphasizes the need for interactive, task-based evaluation.
本文提出了一种以人为本的基准来评估能够从自然语言提示生成全栈web应用程序的代理AI系统。通过Replit、Bolt和Firebase Studio,研究评估了288个从96个提示生成的应用程序。结果表明,Firebase Studio在易用性、信任度、视觉吸引力和视觉适宜性方面均优于其他平台,而Bolt和Replit在不同指标上表现各异。这突显了在提示到应用系统中需要进行交互式、任务导向的评估的必要性。
Eventizing Traditionally Opaque Binary Neural Networks as 1-safe Petri net Models
Authors: Mohamed Tarraf, Alex Chan, Alex Yakovlev, Rishad Shafik
First: 2026-02-13T17:25:47+00:00 · Latest: 2026-02-13T17:25:47+00:00
Comments: Pre-print of latest work
Abstract
Binary Neural Networks (BNNs) offer a low-complexity and energy-efficient alternative to traditional full-precision neural networks by constraining their weights and activations to binary values. However, their discrete, highly non-linear behavior makes them difficult to explain, validate and formally verify. As a result, BNNs remain largely opaque, limiting their suitability in safety-critical domains, where causal transparency and behavioral guarantees are essential. In this work, we introduce a Petri net (PN)-based framework that captures the BNN's internal operations as event-driven processes. By "eventizing" their operations, we expose their causal relationships and dependencies for a fine-grained analysis of concurrency, ordering, and state evolution. Here, we construct modular PN blueprints for core BNN components including activation, gradient computation and weight updates, and compose them into a complete system-level model. We then validate the composed PN against a reference software-based BNN, verify it against reachability and structural checks to establish 1-safeness, deadlock-freeness, mutual exclusion and correct-by-construction causal sequencing, before we assess its scalability and complexity at segment, component, and system levels using the automated measurement tools in Workcraft. Overall, this framework enables causal introspection of transparent and event-driven BNNs that are amenable to formal reasoning and verification.
中文标题/摘要
标题:将传统不透明的二值神经网络事件化为1-安全Petri网模型
二值神经网络(BNNs)通过将权重和激活值限制为二值,提供了一种低复杂度和节能的替代传统全精度神经网络的选择。然而,它们的离散、高度非线性行为使其难以解释、验证和形式化验证。因此,BNNs 仍然很大程度上是不透明的,限制了它们在需要因果透明性和行为保证的安全关键领域中的适用性。在本文中,我们引入了一种基于Petri网(PN)的框架,该框架将BNN的内部操作表示为事件驱动的过程。通过“事件化”其操作,我们揭示了它们的因果关系和依赖关系,以便对并发性、顺序性和状态演化进行精细分析。在这里,我们为BNN的核心组件(包括激活、梯度计算和权重更新)构建了模块化的PN蓝图,并将它们组合成一个完整的系统级模型。然后,我们使用参考软件BNN验证组合的PN,通过可达性和结构检查验证其1-安全性、无死锁性、互斥性和构造正确的因果顺序,最后使用Workcraft中的自动化测量工具评估其在段、组件和系统级别上的可扩展性和复杂性。总体而言,该框架使透明且事件驱动的BNN能够进行因果内省,并且易于进行形式化推理和验证。
Summary / 总结
This work addresses the opacity of Binary Neural Networks (BNNs) by introducing a Petri net (PN)-based framework that captures BNN operations as event-driven processes, enhancing their causal transparency and behavioral guarantees. The framework constructs modular PN blueprints for BNN components and composes them into a complete system-level model, which is then validated for safety and correctness. Scalability and complexity are assessed using automated measurement tools, enabling formal reasoning and verification of BNNs in safety-critical domains.
本文通过引入基于Petri网(PN)的框架,将BNN的操作表示为事件驱动的过程,以增强其因果透明性和行为保证。该框架为BNN组件构建了模块化的PN蓝图,并将它们组合成一个完整的系统级模型,然后通过可达性和结构检查验证其安全性与正确性。使用Workcraft中的自动化测量工具评估其可扩展性和复杂性,从而使得BNNs在安全关键领域中易于进行形式化推理和验证。
How to Train Your LLM Web Agent: A Statistical Diagnosis
Authors: Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Piché, Alexandre Lacoste, Massimo Caccia
First: 2025-07-05T17:12:33+00:00 · Latest: 2026-02-13T17:24:17+00:00
Abstract
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
中文标题/摘要
标题:如何训练你的LLM网络代理:一种统计诊断
基于LLM的网络代理最近取得了显著进展,但大部分进展发生在封闭源系统中,与开源替代方案之间的差距越来越大。进展受到两个关键挑战的阻碍:首先,专注于单一步骤任务,忽视了多步骤网络交互的复杂性;其次,需要大量计算资源来后训练基于LLM的网络代理。为了解决这个问题,我们提出了第一个基于统计的后训练计算分配研究。我们的方法使用两阶段管道,通过监督微调(SFT)训练Llama 3.1 8B学生模仿Llama 3.3 70B教师,然后通过策略梯度强化学习。我们发现这个过程对超参数选择高度敏感,使得全面搜索不切实际。为了节省他人昂贵的试错成本,我们采样了1,370种配置,并使用自助法估计有效的超参数。我们的结果显示,将SFT与策略梯度RL结合使用在WorkArena和MiniWob++上始终优于单独使用任一方法。此外,这种策略只需要纯SFT在MiniWob++上达到峰值性能所需计算量的55%,有效地推动了计算-性能帕累托前沿,并且是唯一能够缩小与封闭源模型差距的策略。
Summary / 总结
This study addresses the challenges of training open-source LLM-based web agents by proposing a two-stage pipeline that combines supervised fine-tuning and on-policy reinforcement learning. The research finds that this approach, when effectively tuned, outperforms either method alone on WorkArena and MiniWob++. Additionally, it requires significantly less compute resources, achieving peak performance with only 55% of the resources needed for pure supervised fine-tuning on MiniWob++. This work provides a statistically grounded method to optimize hyperparameters and reduce the compute costs for training LLM web agents.
该研究通过提出结合监督微调和策略梯度强化学习的两阶段管道,解决了训练开源LLM网络代理的挑战。研究发现,当优化时,该方法在WorkArena和MiniWob++上的表现优于单独使用任何一种方法。此外,这种方法所需的计算资源显著减少,仅需纯监督微调在MiniWob++上达到峰值性能所需资源的55%。这项工作提供了一种统计上可靠的方法来优化超参数并降低训练LLM网络代理的计算成本。
Batch-CAM: Introduction to better reasoning in convolutional deep learning models
Authors: Giacomo Ignesti, Davide Moroni, Massimo Martinelli
First: 2025-10-01T08:47:00+00:00 · Latest: 2026-02-13T17:11:23+00:00
Comments: 10 pages, 6 figures, submitted to Signal, Image and Video Processing, Springer Nature
Abstract
Deep learning opacity often impedes deployment in high-stakes domains. We propose a training framework that aligns model focus with class-representative features without requiring pixel-level annotations. To this end, we introduce Batch-CAM, a vectorised implementation of Gradient-weighted Class Activation Mapping that integrates directly into the training loop with minimal computational overhead. We propose two regularisation terms: a Prototype Loss, which aligns individual-sample attention with the global class average, and a Batch-CAM Loss, which enforces consistency within a training batch. These are evaluated using L1, L2, and SSIM metrics. Validated on MNIST and Fashion-MNIST using ResNet18 and ConvNeXt-V2, our method generates significantly more coherent and human-interpretable saliency maps compared to baselines. While maintaining competitive classification accuracy, the framework successfully suppresses spurious feature activation, as evidenced by qualitative reconstruction analysis. Batch-CAM appears to offer a scalable pathway for training intrinsically interpretable models by leveraging batch-level statistics to guide feature extraction, effectively bridging the gap between predictive performance and explainability.
中文标题/摘要
标题:Batch-CAM:提高卷积深度学习模型推理能力的介绍
深度学习的不透明性常阻碍其在高风险领域的部署。我们提出了一种训练框架,使模型聚焦于类代表特征,而无需像素级注释。为此,我们引入了Batch-CAM,这是一种向量化实现的梯度加权类激活映射,可以直接集成到训练循环中,且计算开销较小。我们提出了两种正则化项:原型损失,使单个样本的注意力与全局类平均值对齐;Batch-CAM损失,确保训练批次内部一致性。这些损失通过L1、L2和SSIM指标进行评估。在MNIST和Fashion-MNIST上使用ResNet18和ConvNeXt-V2验证,我们的方法生成的显著性图比基线方法更为连贯且易于人类解读。同时保持竞争力的分类准确性,该框架成功抑制了虚假特征激活,如定性重构分析所示。Batch-CAM似乎提供了一种通过利用批次级统计信息指导特征提取,从而实现预测性能与可解释性之间平衡的可扩展途径。
Summary / 总结
The research addresses the issue of opacity in deep learning models by proposing Batch-CAM, a training framework that aligns model focus with class-representative features. It introduces two regularization terms, Prototype Loss and Batch-CAM Loss, to enhance interpretability. Evaluated on MNIST and Fashion-MNIST using ResNet18 and ConvNeXt-V2, Batch-CAM generates more coherent and human-interpretable saliency maps while maintaining competitive classification accuracy and suppressing spurious feature activation.
研究旨在通过引入Batch-CAM框架提高卷积深度学习模型在高风险领域的可解释性,该框架通过两个正则化项:Prototype Loss和Batch-CAM Loss,将模型焦点与类代表特征对齐,并且这些项以最小的计算开销直接集成到训练循环中。该方法在MNIST和Fashion-MNIST上使用ResNet18和ConvNeXt-V2进行评估,生成了比基线更为一致和易于人类理解的显著性图,同时保持了竞争力的分类精度。定性分析表明,Batch-CAM抑制了不必要的特征激活,提高了模型的可解释性。
SCOPE: Selective Conformal Optimized Pairwise LLM Judging
Authors: Sher Badshah, Ali Emami, Hassan Sajjad
First: 2026-02-13T17:10:43+00:00 · Latest: 2026-02-13T17:10:43+00:00
Abstract
Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $α$. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at $α= 0.10$, \textsc{Scope} consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk $\approx 0.097$ to $0.099$), while retaining substantial coverage, reaching $0.89$ on RewardBench with Qwen-14B and $0.98$ on RewardBench with Qwen-32B. Compared to naïve baselines, \textsc{Scope} accepts up to $2.4\times$ more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.
中文标题/摘要
标题:SCOPE:选择性校准优化成对LLM评判
大型语言模型(LLMs)越来越多地被用作评判者,以替代昂贵的人类偏好标签在成对评估中的使用。尽管它们具有实用性,但LLM评判者仍然容易出现校准不当和系统性偏差。本文提出SCOPE(选择性校准优化成对评估)框架,该框架在有限样本统计保证下进行选择性成对评判。在可交换性假设下,SCOPE校准一个接受阈值,使得非弃权评判中的错误率最多为用户指定的水平α。为了使SCOPE获得无偏的不确定性信号,我们引入双向偏好熵(BPE),该方法在两个响应位置下查询评判者,聚合隐含的偏好概率以确保对响应顺序的不变性,并将聚合的概率转换为基于熵的不确定性评分。在MT-Bench、RewardBench和Chatbot竞技场中,BPE在不确定性质量上优于标准置信度代理,提供了一个更强的选择信号,使SCOPE能够一致地达到目标风险水平,同时在评判者规模上保持良好的覆盖率。特别是,在α=0.10时,SCOPE在所有基准和评判者规模上一致满足风险上限(经验风险≈0.097到0.099),同时保持较高的覆盖率,达到RewardBench上Qwen-14B的0.89和Qwen-32B的0.98。与朴素基线相比,在相同的预期风险约束下,SCOPE在Qwen-7B上的接受评判数量最多增加2.4倍,表明BPE使基于LLM的评估变得可靠且具有高覆盖率。
Summary / 总结
SCOPE is a framework for selective pairwise evaluation using large language models (LLMs) with finite-sample statistical guarantees. It introduces Bidirectional Preference Entropy (BPE) to provide a bias-neutral uncertainty signal, enabling LLMs to make calibrated judgments. Across various benchmarks, SCOPE consistently meets the target risk level while maintaining good coverage, demonstrating reliable and high-coverage LLM-based evaluation.
SCOPE 是一个具有有限样本统计保证的框架,用于选择性地进行成对判断,以解决大型语言模型(LLMs)在偏好判断中的失校准和偏差问题。它引入了双向偏好熵(BPE),以提供无偏的不确定性信号,这在标准置信度代理之上提高了不确定性质量。在MT-Bench、RewardBench和Chatbot Arena上的实验表明,SCOPE 一致地达到了目标风险水平,同时保持了良好的覆盖率,实测风险接近指定水平,并且与朴素基线相比,接受的判断数量最多提高了2.4倍。
Which Algorithms Can Graph Neural Networks Learn?
Authors: Solveig Wittig, Antonis Vasileiou, Robert R. Nerem, Timo Stoll, Floris Geerts, Yusu Wang, Christopher Morris
First: 2026-02-13T17:09:50+00:00 · Latest: 2026-02-13T17:09:50+00:00
Abstract
In recent years, there has been growing interest in understanding neural architectures' ability to learn to execute discrete algorithms, a line of work often referred to as neural algorithmic reasoning. The goal is to integrate algorithmic reasoning capabilities into larger neural pipelines. Many such architectures are based on (message-passing) graph neural networks (MPNNs), owing to their permutation equivariance and ability to deal with sparsity and variable-sized inputs. However, existing work is either largely empirical and lacks formal guarantees or it focuses solely on expressivity, leaving open the question of when and how such architectures generalize beyond a finite training set. In this work, we propose a general theoretical framework that characterizes the sufficient conditions under which MPNNs can learn an algorithm from a training set of small instances and provably approximate its behavior on inputs of arbitrary size. Our framework applies to a broad class of algorithms, including single-source shortest paths, minimum spanning trees, and general dynamic programming problems, such as the $0$-$1$ knapsack problem. In addition, we establish impossibility results for a wide range of algorithmic tasks, showing that standard MPNNs cannot learn them, and we derive more expressive MPNN-like architectures that overcome these limitations. Finally, we refine our analysis for the Bellman-Ford algorithm, yielding a substantially smaller required training set and significantly extending the recent work of Nerem et al. [2025] by allowing for a differentiable regularization loss. Empirical results largely support our theoretical findings.
中文标题/摘要
标题:哪些算法图神经网络能够学习?
近年来,人们越来越关注神经架构学习执行离散算法的能力,这一领域的研究通常被称为神经算法推理。目标是将算法推理能力整合到更大的神经管道中。许多这样的架构基于(消息传递)图神经网络(MPNNs),因为它们具有置换不变性和处理稀疏性和可变大小输入的能力。然而,现有工作要么主要是经验性的,缺乏形式上的保证,要么仅关注表达能力,留下了这些架构在有限训练集之外泛化的何时和如何的问题。在本工作中,我们提出了一种通用的理论框架,以表征在训练集中从小型实例学习算法并在任意大小的输入上证明逼近其行为的充分条件。我们的框架适用于一系列算法,包括单源最短路径、最小生成树以及一般的动态规划问题,如0-1背包问题。此外,我们为一系列算法任务建立了不可能性结果,表明标准MPNNs无法学习它们,并推导出更具表达能力的MPNN-like架构以克服这些限制。最后,我们对贝尔曼-福德算法进行了细化分析,导致所需的训练集显著减小,并显著扩展了Nerem等人[2025]的近期工作,允许使用可微正则化损失。实验结果大体上支持我们的理论发现。
Summary / 总结
This work aims to understand under what conditions message-passing graph neural networks (MPNNs) can learn and generalize algorithms beyond their training set. The authors propose a theoretical framework that characterizes the sufficient conditions for MPNNs to learn algorithms like shortest paths and dynamic programming problems. They also establish limitations for MPNNs in learning certain tasks and propose more expressive architectures. Empirical results support the theoretical findings, showing improved performance for the Bellman-Ford algorithm with differentiable regularization.
研究探讨了消息传递图神经网络(MPNN)在学习和泛化离散算法方面的条件。研究提出了一种理论框架,描述了MPNN从少量训练实例学习算法并在更大输入上近似其行为的充分条件。该研究适用于最短路径、动态规划等多种算法。研究还确定了MPNN在学习某些任务方面的局限性,并提出了更具表达性的架构来克服这些限制。实验证据支持理论发现,特别是对于贝尔曼-福特算法,所需的训练集更小,并且扩展了Nerem等人[2025]的工作,允许使用可微正则化损失。
Random Forests as Statistical Procedures: Design, Variance, and Dependence
Authors: Nathaniel S. O'Connell
First: 2026-02-13T17:08:43+00:00 · Latest: 2026-02-13T17:08:43+00:00
Comments: 26 pages, 2 figures. Supplementary material included
Abstract
Random forests are widely used prediction procedures, yet are typically described algorithmically rather than as statistical designs acting on a fixed dataset. We develop a finite-sample, design-based formulation of random forests in which each tree is an explicit randomized conditional regression function. This perspective yields an exact variance identity for the forest predictor that separates finite-aggregation variability from a structural dependence term that persists even under infinite aggregation. We further decompose both single-tree dispersion and inter-tree covariance using the laws of total variance and covariance, isolating two fundamental design mechanisms-reuse of training observations and alignment of data-adaptive partitions. These mechanisms induce a strict covariance floor, demonstrating that predictive variability cannot be eliminated by increasing the number of trees alone. The resulting framework clarifies how resampling, feature-level randomization, and split selection govern resolution, tree variability, and dependence, and establishes random forests as explicit finite-sample statistical designs whose behavior is determined by their underlying randomized construction.
中文标题/摘要
标题:随机森林作为统计程序:设计、方差与依赖性
随机森林广泛用于预测,但通常以算法形式而非作为作用于固定数据集的统计设计来描述。我们发展了一种有限样本的基于设计的随机森林表述,其中每棵树是显式的随机条件回归函数。这种视角导出了森林预测器的确切方差恒等式,将有限聚合的变异性与即使在无限聚合下也存在的结构依赖项分离开来。我们进一步使用总方差和总协方差定律分解单树的离散度和树间协方差,分离出两种基本的设计机制——训练观察的重用和数据自适应分区的对齐。这些机制诱导了一个严格的协方差下限,表明仅通过增加树的数量无法消除预测变异性。由此形成的框架阐明了抽样、特征级随机化和分裂选择如何控制分辨率、树的变异性与依赖性,并确立了随机森林作为由其随机化构建决定的明确有限样本统计设计。
Summary / 总结
This paper provides a statistical design perspective on random forests, treating each tree as a randomized conditional regression function. It derives an exact variance identity that separates finite-aggregation variability from structural dependence. The study decomposes single-tree dispersion and inter-tree covariance, highlighting the impact of resampling, feature-level randomization, and split selection. Key findings include the existence of a strict covariance floor that cannot be eliminated by increasing the number of trees, indicating that predictive variability is fundamentally tied to the design mechanisms of random forests.
论文旨在从统计学角度重新审视随机森林,将其视为明确的随机条件回归函数。它发展了一种有限样本的表述方式,将聚合变异性和结构性依赖性区分开来。关键发现包括识别了两种基本的设计机制:训练样本的重用和数据自适应分区的对齐,这两种机制导致了一个严格的协方差下限,表明仅增加树的数量并不能消除预测变异。
R-Diverse: Mitigating Diversity Illusion in Self-Play LLM Training
Authors: Gengsheng Li, Jinghan He, Shijie Wang, Dan Zhang, Ruiqi Liu, Renrui Zhang, Zijun Yao, Junfeng Fang, Haiyun Guo, Jinqiao Wang
First: 2026-02-13T17:07:42+00:00 · Latest: 2026-02-13T17:07:42+00:00
Abstract
Self-play bootstraps LLM reasoning through an iterative Challenger-Solver loop: the Challenger is trained to generate questions that target the Solver's capabilities, and the Solver is optimized on the generated data to expand its reasoning skills. However, existing frameworks like R-Zero often exhibit non-sustained improvement, where early gains degrade as self-play continues. We identify a key failure mode, Diversity Illusion, where the Solver's training signals appear diverse yet collapse into recurring underlying patterns. It manifests as (1) Local Diversity Illusion, where diversity is enforced only within-batch, inducing cross-iteration mode cycling; and (2) Surface Diversity Illusion, where questions vary superficially but require near-identical reasoning skills. To mitigate them, we propose R-Diverse with two aligned innovations: Memory-Augmented Penalty (MAP), which uses a persistent memory bank to discourage recycling across iterations, and Skill-Aware Measurement (SAM), which evaluates diversity by the reasoning skills exercised rather than surface variation of questions. Across 10 math and general reasoning benchmarks, R-Diverse sustains gains over more iterations and consistently outperforms prior self-play methods. Code is available at https://github.com/Gengsheng-Li/R-Diverse.
中文标题/摘要
标题:R-多元:缓解自游戏训练LLM的多样性错觉
自游戏通过迭代的挑战者-解决者循环来提升LLM的推理能力:挑战者被训练生成针对解决者能力的问题,而解决者则在生成的数据上进行优化以扩展其推理技能。然而,现有的框架如R-零经常表现出非持续改进,早期的改进随着自游戏的继续而退化。我们识别出一个关键的失败模式,即多样性错觉,其中解决者的训练信号看似多样化,但实际上却陷入重复的基础模式。这表现为(1)局部多样性错觉,其中多样性仅在批次内被强制执行,导致跨迭代模式循环;(2)表面多样性错觉,其中问题虽然表面上有所变化,但所需的推理技能却几乎相同。为了缓解这些问题,我们提出了R-多元,其中包含两项对齐的创新:持久记忆增强惩罚(MAP),使用持久的记忆库来阻止跨迭代的重复使用,以及技能感知测量(SAM),通过评估实际使用的推理技能而非问题表面变化来衡量多样性。在10个数学和一般推理基准测试中,R-多元在更多迭代中保持了改进,并且始终优于之前的自游戏方法。代码可在https://github.com/Gengsheng-Li/R-Diverse/ 获取。
Summary / 总结
The paper addresses the issue of Diversity Illusion in self-play training of large language models (LLMs), where the apparent diversity of training signals collapses into recurring patterns. It introduces R-Diverse, which uses Memory-Augmented Penalty (MAP) to discourage recycling of training data across iterations and Skill-Aware Measurement (SAM) to assess diversity based on reasoning skills rather than surface-level question variations. Experiments on 10 math and general reasoning benchmarks show that R-Diverse sustains performance gains over more iterations and outperforms previous self-play methods.
论文针对大型语言模型(LLM)自游戏训练中出现的多样性幻觉问题,即早期的推理技能提升会在后续迭代中退化。提出了R-Diverse,通过持久记忆库(Memory-Augmented Penalty, MAP)防止跨迭代重复使用训练数据,并通过技能感知测量(Skill-Aware Measurement, SAM)来基于实际使用的推理技能来评估多样性。实验结果显示,R-Diverse在10个数学和一般推理基准上能够持续提升性能并优于之前的自游戏方法。
Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation
Authors: Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C. S. Lui, Wei Chen, Carlee Joe-Wong
First: 2025-08-11T06:53:27+00:00 · Latest: 2026-02-13T17:03:20+00:00
Comments: Accepted to INFOCOM 2026
Abstract
Large Language Models (LLMs) are revolutionizing how users interact with information systems, yet their high inference cost poses serious scalability and sustainability challenges. Caching inference responses, allowing them to be retrieved without another forward pass through the LLM, has emerged as one possible solution. Traditional exact-match caching, however, overlooks the semantic similarity between queries, leading to unnecessary recomputation. Semantic caching addresses this by retrieving responses based on semantic similarity, but introduces a fundamentally different cache eviction problem: one must account for mismatch costs between incoming queries and cached responses. Moreover, key system parameters, such as query arrival probabilities and serving costs, are often unknown and must be learned over time. Existing semantic caching methods are largely ad-hoc, lacking theoretical foundations and unable to adapt to real-world uncertainty. In this paper, we present a principled, learning-based framework for semantic cache eviction under unknown query and cost distributions. We formulate both offline optimization and online learning variants of the problem, and develop provably efficient algorithms with state-of-the-art guarantees. We also evaluate our framework on a synthetic dataset, showing that our proposed algorithms perform matching or superior performance compared with baselines.
中文标题/摘要
标题:低成本LLM服务中的语义缓存:从离线学习到在线适应
大规模语言模型(LLMs)正在改变用户与信息系统交互的方式,但其高昂的推理成本提出了严重的可扩展性和可持续性挑战。通过缓存推理响应,使其在不需要再次通过LLM进行前向传递的情况下被检索,已成为一种可能的解决方案。然而,传统的精确匹配缓存忽略了查询之间的语义相似性,导致不必要的重新计算。语义缓存通过基于语义相似性检索响应来解决这一问题,但引入了一个根本不同的缓存淘汰问题:必须考虑传入查询与缓存响应之间的不匹配成本。此外,诸如查询到达概率和提供成本等关键系统参数往往是未知的,并且必须随着时间学习。现有的语义缓存方法大多是临时性的,缺乏理论基础,无法适应现实世界的不确定性。在本文中,我们提出了一种在未知查询和成本分布下的语义缓存淘汰的原理性、基于学习的框架。我们形式化了该问题的离线优化和在线学习变体,并开发了具有最先进的保证的证明有效算法。我们还在合成数据集上评估了我们的框架,结果显示我们提出的算法与基线相比具有匹配或更优的性能。
Summary / 总结
This paper addresses the challenge of efficiently serving Large Language Models (LLMs) by proposing a semantic caching framework that leverages semantic similarity for cache eviction. The authors formulate both offline optimization and online learning variants of the problem and develop provably efficient algorithms. Experimental results on a synthetic dataset demonstrate that their proposed algorithms outperform or match the performance of baseline methods.
论文提出了一种语义缓存框架来应对大型语言模型(LLM)的高推理成本问题。它形式化了离线和在线学习的缓存淘汰问题,并开发了具有理论保证的有效算法。实验结果表明,所提出的方法在合成数据集上优于或匹配基准方法。
Mathematics and Machine Creativity: A Survey on Bridging Mathematics with AI
Authors: Shizhe Liang, Wei Zhang, Tianyang Zhong, Tianming Liu
First: 2024-12-21T08:58:36+00:00 · Latest: 2026-02-13T17:01:27+00:00
Comments: This article is withdrawn due to internal authorship and supervisory considerations that require clarification before the work can proceed in its current form. After further review, I believe it is appropriate to pause and formally resolve these matters to ensure full compliance with institutional and collaborative research policies
Abstract
This paper presents a comprehensive overview on the applications of artificial intelligence (AI) in mathematical research, highlighting the transformative role AI has begun to play in this domain. Traditionally, AI advancements have heavily relied on theoretical foundations provided by mathematics and statistics. However, recent developments in AI, particularly in reinforcement learning (RL) and large language models (LLMs), have demonstrated the potential for AI to contribute back to mathematics by offering flexible algorithmic frameworks and powerful inductive reasoning capabilities that support various aspects of mathematical research. This survey aims to establish a bridge between AI and mathematics, providing insights into the mutual benefits and fostering deeper interdisciplinary understanding.
In particular, we argue that while current AI and LLMs may struggle with complex deductive reasoning, their "inherent creativity", the ability to generate outputs at high throughput based on recognition of shallow patterns, holds significant potential to support and inspire mathematical research. This creative capability, often overlooked, could be the key to unlocking new perspectives and methodologies in mathematics. Furthermore, we address the lack of cross-disciplinary communication: mathematicians may not fully comprehend the latest advances in AI, while AI researchers frequently prioritize benchmark performance over real-world applications in frontier mathematical research. This paper seeks to close that gap, offering a detailed exploration of AI fundamentals, its strengths, and its emerging applications in the mathematical sciences.
中文标题/摘要
标题:数学与机器创造力:人工智能与数学融合综述
本文综述了人工智能(AI)在数学研究中的应用,强调了AI在这一领域开始发挥的变革性作用。传统上,AI的进步很大程度上依赖于数学和统计学提供的理论基础。然而,最近AI的发展,特别是在强化学习(RL)和大型语言模型(LLMs)方面,已经展示了AI能够通过提供灵活的算法框架和强大的归纳推理能力,为数学研究做出贡献的潜力。本文旨在建立AI与数学之间的桥梁,提供相互利益的见解,并促进更深层次的跨学科理解。
特别地,我们认为虽然当前的AI和LLMs在复杂演绎推理方面可能存在问题,但它们的“内在创造力”,即基于浅层模式识别生成大量输出的能力,具有支持和激发数学研究的潜力。这种创造能力往往被忽视,可能是解锁数学新视角和方法的关键。此外,我们还讨论了跨学科沟通的缺乏:数学家可能无法完全理解AI的最新进展,而AI研究人员则经常优先考虑基准性能而非前沿数学研究的实际应用。本文旨在弥合这一差距,详细探讨AI的基本原理、优势及其在数学科学中的新兴应用。
Summary / 总结
This paper surveys the application of artificial intelligence (AI) in mathematical research, emphasizing the transformative role of AI, particularly in reinforcement learning and large language models, in supporting various aspects of mathematical research. The study highlights the potential of AI's 'inherent creativity' in generating outputs based on shallow patterns, which could inspire new perspectives and methodologies in mathematics. It also addresses the need for better cross-disciplinary communication between mathematicians and AI researchers to leverage the strengths of AI in mathematical sciences.
本文综述了人工智能在数学研究中的应用,强调了AI,尤其是强化学习和大型语言模型,在通过灵活的算法框架和归纳推理支持数学研究方面所发挥的变革性作用。研究指出,AI的“内在创造力”,即基于浅层模式生成输出的能力,可以激发新的数学视角和方法论。然而,文章指出数学家和AI研究人员之间缺乏跨学科沟通,并建议通过详细探索AI基础知识来弥合这一差距,促进更深层次的跨学科理解。
Consistency of Large Reasoning Models Under Multi-Turn Attacks
Authors: Yubo Li, Ramayya Krishnan, Rema Padman
First: 2026-02-13T16:58:47+00:00 · Latest: 2026-02-13T16:58:47+00:00
Abstract
Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence induced by extended reasoning traces; counterintuitively, random confidence embedding outperforms targeted extraction. Our results highlight that reasoning capabilities do not automatically confer adversarial robustness and that confidence-based defenses require fundamental redesign for reasoning models.
中文标题/摘要
标题:大型推理模型在多轮攻击下的一致性
具有推理能力的大型推理模型在复杂任务上取得了最先进的性能,但它们在多轮对抗压力下的鲁棒性尚未得到充分探索。我们评估了九种前沿推理模型在对抗攻击下的表现。我们的研究发现,推理赋予了有意义但不完整的鲁棒性:大多数研究的推理模型显著优于指令调优基线,但所有模型都表现出不同的脆弱性特征,误导性建议普遍有效,社会压力显示出模型特定的有效性。通过轨迹分析,我们确定了五种失败模式(自我怀疑、社会从众、建议劫持、情绪易感性和推理疲劳),前两种占50%的失败。我们进一步证明,针对标准语言模型有效的信心感知响应生成(CARG)对推理模型无效,因为扩展的推理痕迹导致了过度自信;出人意料的是,随机信心嵌入优于目标提取。我们的结果表明,推理能力并不自动赋予对抗鲁棒性,基于信心的防御需要对推理模型进行根本性重新设计。
Summary / 总结
This study evaluates the robustness of nine advanced reasoning models under multi-turn adversarial attacks, finding that while reasoning models generally outperform instruction-tuned baselines, they still exhibit distinct vulnerabilities. The research identifies five failure modes, with Self-Doubt and Social Conformity being the most prevalent. Confidence-Aware Response Generation, which works well for standard language models, fails for reasoning models due to overconfidence, while random confidence embedding shows better performance. This suggests that reasoning capabilities do not inherently provide robustness against adversarial attacks and that new defense strategies are needed for reasoning models.
研究评估了九种先进推理模型在多轮对抗攻击下的鲁棒性,发现虽然推理模型优于指令调优基线,但它们存在不同的脆弱性。关键失败模式包括自我怀疑和社会顺从,占失败的一半。针对标准LLM有效的信心感知响应生成,在推理模型中因过度自信而失效,而随机信心嵌入表现更好。这表明,推理能力并不能自动提供对抗攻击的鲁棒性,信心基的防御需要为推理模型进行根本性 redesign。
Reasoning about Intent for Ambiguous Requests
Authors: Irina Saparina, Mirella Lapata
First: 2025-11-13T16:18:45+00:00 · Latest: 2026-02-13T16:55:57+00:00
Abstract
Large language models often respond to ambiguous requests by implicitly committing to one interpretation. Intent misunderstandings can frustrate users and create safety risks. To address this, we propose generating multiple interpretation-answer pairs in a single structured response to ambiguous requests. Our models are trained with reinforcement learning and customized reward functions using multiple valid answers as supervision. Experiments on conversational question answering and semantic parsing demonstrate that our method achieves higher coverage of valid answers than baseline approaches. Human evaluation confirms that predicted interpretations are highly aligned with their answers. Our approach promotes transparency with explicit interpretations, achieves efficiency by requiring only one generation step, and supports downstream applications through its structured output format.
中文标题/摘要
标题:对含糊请求进行意图推理
大型语言模型常常通过隐式地选择一种解释来回应含糊的请求。意图误解会令用户感到沮丧并带来安全风险。为了解决这一问题,我们提出了一种在单一结构化响应中生成多个解释-答案对的方法来处理含糊请求。我们的模型通过强化学习和定制的奖励函数进行训练,使用多个有效答案作为监督。在对话式问答和语义解析实验中,我们的方法在有效答案的覆盖率上优于基线方法。人类评估证实,预测的解释与答案高度一致。我们的方法通过明确的解释促进透明度,通过仅需一次生成步骤实现高效性,并通过结构化输出格式支持下游应用。
Summary / 总结
The paper addresses the issue of large language models misunderstanding ambiguous requests by proposing a method to generate multiple interpretation-answer pairs in a single structured response. The models are trained using reinforcement learning and customized reward functions. Experiments show that this approach covers more valid answers than baseline methods and aligns well with human interpretations, promoting transparency and efficiency in handling ambiguous requests.
论文针对大型语言模型在处理模糊请求时可能出现的误解问题,提出了生成多个解释-答案对的方法,以解决用户不满和安全风险。该方法通过强化学习和定制的奖励函数进行训练。实验结果显示,这种方法比基线方法覆盖了更多的有效答案,且人工评估表明预测的解释与答案高度一致。这种方法提高了透明度、效率,并通过结构化输出支持下游应用。
Panning for Gold: Expanding Domain-Specific Knowledge Graphs with General Knowledge
Authors: Runhao Zhao, Weixin Zeng, Wentao Zhang, Chong Chen, Zhengpin Li, Xiang Zhao, Lei Chen
First: 2026-01-15T15:06:56+00:00 · Latest: 2026-02-13T16:55:09+00:00
Comments: 13 pages, 3 figures
Abstract
Domain-specific knowledge graphs (DKGs) are critical yet often suffer from limited coverage compared to General Knowledge Graphs (GKGs). Existing tasks to enrich DKGs rely primarily on extracting knowledge from external unstructured data or completing KGs through internal reasoning, but the scope and quality of such integration remain limited. This highlights a critical gap: little systematic exploration has been conducted on how comprehensive, high-quality GKGs can be effectively leveraged to supplement DKGs.
To address this gap, we propose a new and practical task: domain-specific knowledge graph fusion (DKGF), which aims to mine and integrate relevant facts from general knowledge graphs into domain-specific knowledge graphs to enhance their completeness and utility. Unlike previous research, this new task faces two key challenges: (1) high ambiguity of domain relevance, i.e., difficulty in determining whether knowledge from a GKG is truly relevant to the target domain , and (2) cross-domain knowledge granularity misalignment, i.e., GKG facts are typically abstract and coarse-grained, whereas DKGs frequently require more contextualized, fine-grained representations aligned with particular domain scenarios.
To address these, we present ExeFuse, a neuro-symbolic framework based on a novel Fact-as-Program paradigm. ExeFuse treats fusion as an executable process, utilizing neuro-symbolic execution to infer logical relevance beyond surface similarity and employing target space grounding to calibrate granularity. We construct two new datasets to establish the first standardized evaluation suite for this task. Extensive experiments demonstrate that ExeFuse effectively overcomes domain barriers to achieve superior fusion performance.
中文标题/摘要
标题:淘金:利用通用知识图谱扩展领域特定知识图谱
领域特定知识图谱(DKGs)至关重要,但通常覆盖范围有限,不及通用知识图谱(GKGs)。现有丰富DKGs的方法主要依赖从外部非结构化数据中提取知识或通过内部推理完成KG,但这种整合的范围和质量仍然有限。这突显了一个关键缺口:很少有系统性的探索如何有效利用全面、高质量的GKG来补充DKGs。
为解决这一缺口,我们提出了一项新的实用任务:领域特定知识图谱融合(DKGF),旨在从通用知识图谱中挖掘和整合相关事实,以增强其完整性和实用性。与以往研究不同,这一新任务面临两个关键挑战:(1)领域相关性的高模糊性,即难以确定通用知识图谱中的知识是否真正与目标领域相关;(2)跨域知识粒度不匹配,即通用知识图谱中的事实通常较为抽象和粗粒度,而领域特定知识图谱则经常需要更具体、细粒度的表示,与特定领域场景相匹配。
为解决这些问题,我们提出了基于新型事实即程序范式的神经符号框架ExeFuse。ExeFuse将融合视为可执行的过程,利用神经符号执行推断超越表面相似性的逻辑相关性,并通过目标空间定位校准粒度。我们构建了两个新数据集,以建立该任务的第一个标准化评估套件。广泛的实验表明,ExeFuse有效地克服了领域障碍,实现了卓越的融合性能。
Summary / 总结
The paper addresses the limitation of domain-specific knowledge graphs (DKGs) by proposing a new task called domain-specific knowledge graph fusion (DKGF), which aims to integrate relevant facts from general knowledge graphs (GKGs) to enhance DKGs. The authors introduce ExeFuse, a neuro-symbolic framework that overcomes challenges such as domain relevance ambiguity and cross-domain granularity misalignment. Experiments show that ExeFuse outperforms existing methods in achieving superior fusion performance.
论文通过提出一种新的任务——领域特定知识图谱融合(DKGF),旨在将通用知识图谱(GKG)中的相关事实整合到领域特定知识图谱(DKG)中,以增强DKG的完整性和实用性。作者引入了ExeFuse框架,采用事实作为程序的范式来推断逻辑相关性并校准粒度,克服了领域相关性和知识粒度不匹配的挑战。实验表明,ExeFuse在融合性能上优于现有方法。
WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models
Authors: Yangzhuo Li, Shengpeng Ji, Yifu Chen, Tianle Liang, Haorong Ying, Yule Wang, Junbo Li, Jun Fang, Zhou Zhao
First: 2026-02-12T16:22:11+00:00 · Latest: 2026-02-13T16:49:23+00:00
Comments: Open-source at https://naruto-2024.github.io/wavbench.github.io/
Abstract
With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents. To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes "listenability" through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real-world scenarios. Through evaluating five state-of-the-art models, WavBench offers critical insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models. The benchmark dataset and evaluation toolkit are available at https://naruto-2024.github.io/wavbench.github.io/.
中文标题/摘要
标题:WavBench:评估端到端语音对话模型的推理、口语和副语言能力
随着高级推理能力迅速融入语音对话模型,该领域迫切需要超越简单交互的基准测试,以应对现实世界的复杂性。然而,当前的评估主要遵循文本生成标准,忽视了口语特有的副语言和口语化特征,以及现代代理所需的认知深度。为弥合这一差距,我们引入了WavBench,这是一个全面的基准测试,旨在评估现有工作未能涵盖的现实对话能力。WavBench 建立了一个三部分框架:1) Pro 子集,旨在通过显著增加难度来严格挑战增强推理能力的模型;2) 基础子集,定义了一种新的口语标准,强调“可听性”,通过自然词汇、语言流畅性和互动关系,而非严格的书面准确性;3) 声学子集,涵盖明确理解、生成和隐含对话,以严格评估真实世界场景中的全面副语言能力。通过评估五种最先进的模型,WavBench 提供了关于复杂问题解决、口语表达和副语言保真的关键见解,指导稳健的语音对话模型的演变。基准数据集和评估工具可在 https://naruto-2024.github.io/wavbench.github.io/ 获取。
Summary / 总结
WavBench is a comprehensive benchmark designed to evaluate spoken dialogue models in handling complex reasoning, colloquialisms, and paralinguistics. It introduces a tripartite framework: Pro subset for challenging reasoning, Basic subset for natural colloquialism, and Acoustic subset for paralinguistic capabilities. Evaluating five state-of-the-art models, WavBench provides insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the development of robust spoken dialogue systems.
WavBench 是一个综合基准,旨在评估对话模型在处理复杂推理、口语化表达和副语言能力方面的表现。它引入了三部分框架:Pro 子集用于挑战推理能力,Basic 子集用于自然口语化表达,Acoustic 子集用于副语言能力。通过评估五种最先进的模型,WavBench 提供了复杂问题解决、口语化表达和副语言忠实度交叉领域的见解,指导稳健的语音对话系统的开发。
Post-hoc Probabilistic Vision-Language Models
Authors: Anton Baumann, Rui Li, Marcus Klasson, Santeri Mentu, Shyamgopal Karthik, Zeynep Akata, Arno Solin, Martin Trapp
Venue: ICLR 2026
First: 2024-12-08T18:16:13+00:00 · Latest: 2026-02-13T16:49:09+00:00
Comments: Published at ICLR 2026. Project page: https://aaltoml.github.io/BayesVLM/
Abstract
Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.
中文标题/摘要
标题:事后概率视觉-语言模型
视觉-语言模型(VLMs),如CLIP和SigLIP,在分类、检索和生成任务中取得了显著的成功。为此,VLMs将图像和文本描述确定性地映射到一个联合潜在空间,在该空间中使用余弦相似度评估它们的相似性。然而,在下游任务中使用确定性映射输入时,无法捕捉由于领域转移而产生的概念不确定性。在本文中,我们提出了一种事后不确定性估计方法,不需要额外的训练。我们的方法利用VLMs最后一层的贝叶斯后验近似,并分析地量化了余弦相似度的不确定性。我们展示了其在不确定性量化和积极学习支持集选择中的有效性。与基线相比,我们获得了改进且校准良好的预测不确定性、可解释的不确定性估计以及样本高效的积极学习。我们的结果表明,对于大规模模型的安全关键应用具有前景。
Summary / 总结
This work addresses the limitations of deterministic mappings in vision-language models (VLMs) by proposing a post-hoc method for uncertainty estimation. The method approximates the Bayesian posterior over the last layers of VLMs and quantifies uncertainties in cosine similarities. The approach improves predictive uncertainties, provides interpretable uncertainty estimates, and enhances sample-efficient active learning compared to baselines, making it suitable for safety-critical applications.
该研究针对视觉-语言模型(VLMs)中确定性映射的局限性,提出了一种后验不确定性估计方法。该方法通过贝叶斯后验近似来量化图像和文本描述之间余弦相似性的不确定性,无需额外训练。研究显示,这种方法能够提高预测不确定性,提供可解释的不确定性估计,并增强样本高效的主动学习,使其适用于安全关键应用。
EXCODER: EXplainable Classification Of DiscretE time series Representations
Authors: Yannik Hahn, Antonin Königsfeld, Hasan Tercan, Tobias Meisen
First: 2026-02-13T16:47:45+00:00 · Latest: 2026-02-13T16:47:45+00:00
Comments: Accepted at PAKDD 2026
Abstract
Deep learning has significantly improved time series classification, yet the lack of explainability in these models remains a major challenge. While Explainable AI (XAI) techniques aim to make model decisions more transparent, their effectiveness is often hindered by the high dimensionality and noise present in raw time series data. In this work, we investigate whether transforming time series into discrete latent representations-using methods such as Vector Quantized Variational Autoencoders (VQ-VAE) and Discrete Variational Autoencoders (DVAE)-not only preserves but enhances explainability by reducing redundancy and focusing on the most informative patterns. We show that applying XAI methods to these compressed representations leads to concise and structured explanations that maintain faithfulness without sacrificing classification performance. Additionally, we propose Similar Subsequence Accuracy (SSA), a novel metric that quantitatively assesses the alignment between XAI-identified salient subsequences and the label distribution in the training data. SSA provides a systematic way to validate whether the features highlighted by XAI methods are truly representative of the learned classification patterns. Our findings demonstrate that discrete latent representations not only retain the essential characteristics needed for classification but also offer a pathway to more compact, interpretable, and computationally efficient explanations in time series analysis.
中文标题/摘要
标题:EXCODER: 可解释离散时间序列表示分类
深度学习在时间序列分类方面取得了显著进步,但这些模型缺乏解释性仍然是一个主要挑战。尽管可解释人工智能(XAI)技术旨在使模型决策更加透明,但它们的有效性往往受到原始时间序列数据中高维度和噪声的影响。在本文中,我们研究了将时间序列转换为离散的潜在表示(如使用向量量化变分自编码器(VQ-VAE)和离散变分自编码器(DVAE)等方法)是否不仅保留了而且增强了解释性,通过减少冗余并专注于最具信息性的模式。我们展示了将XAI方法应用于这些压缩表示可以产生简洁且结构化的解释,这些解释保持了忠实性而不牺牲分类性能。此外,我们提出了相似子序列准确性(SSA),这是一种新的度量标准,用于定量评估XAI识别的重要子序列与训练数据标签分布之间的对齐程度。SSA提供了一种系统的方法来验证XAI方法突出的特征是否真正代表了学习到的分类模式。我们的研究结果表明,离散的潜在表示不仅保留了分类所需的本质特征,还为时间序列分析提供了更紧凑、可解释和计算效率更高的解释途径。
Summary / 总结
This paper addresses the challenge of explainability in deep learning models for time series classification. It proposes using Vector Quantized Variational Autoencoders (VQ-VAE) and Discrete Variational Autoencoders (DVAE) to transform time series into discrete latent representations, which are then used with Explainable AI (XAI) techniques to provide concise and structured explanations. The study shows that these methods enhance explainability by reducing redundancy and focusing on informative patterns, without compromising classification performance. A new metric, Similar Subsequence Accuracy (SSA), is introduced to validate the relevance of XAI-identified features. The results indicate that discrete latent representations maintain essential characteristics for classification while offering more interpretable explanations.
该研究通过使用VQ-VAE和DVAE将原始时间序列转换为离散的潜在表示,来解决深度学习模型在时间序列分类中的可解释性问题。研究表明,将XAI方法应用于这些压缩表示可以提供简洁且结构化的解释,同时保持分类性能。此外,还提出了一种新的度量标准——相似子序列准确性(SSA),用于验证XAI方法突出的特征是否真正代表了学习到的分类模式。研究结果表明,离散的潜在表示不仅保留了分类所需的本质特征,还为时间序列分析提供了更紧凑、可解释和计算效率更高的解释途径。