Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models
Authors: Li-Zhong Szu-Tu, Ting-Lin Wu, Chia-Jui Chang, He Syu, Yu-Lun Liu
First: 2025-12-24T18:59:54+00:00 · Latest: 2025-12-24T18:59:54+00:00
Comments: Project page: https://sytwu.github.io/BeyondMemo/
Abstract
We expose a significant popularity bias in state-of-the-art vision-language models (VLMs), which achieve up to 34% higher accuracy on famous buildings compared to ordinary ones, indicating a reliance on memorization over generalizable understanding. To systematically investigate this, we introduce the largest open benchmark for this task: the YearGuessr dataset, a collection of 55,546 building images with multi-modal attributes from 157 countries, annotated with continuous ordinal labels of their construction year (1001-2024), GPS data, and page-view counts as a proxy for popularity. Using this dataset, we frame the construction year prediction task as ordinal regression and introduce popularity-aware interval accuracy metrics to quantify this bias. Our resulting benchmark of 30+ models, including our YearCLIP model, confirms that VLMs excel on popular, memorized items but struggle significantly with unrecognized subjects, exposing a critical flaw in their reasoning capabilities. Project page: https://sytwu.github.io/BeyondMemo/
中文标题/摘要
标题:超越记忆:多模态序数回归基准以揭示视觉语言模型中的流行度偏差
我们揭示了最先进的视觉语言模型(VLMs)中存在显著的流行度偏差,这些模型在著名建筑上的准确率比普通建筑高出34%,表明它们依赖于记忆而非可泛化的理解。为了系统地研究这一问题,我们引入了该任务上最大的开放基准数据集:YearGuessr数据集,包含来自157个国家的55,546张建筑图像,具有多模态属性,并附有其建设年份的连续序数标签(1001-2024)、GPS数据和页面浏览量作为流行度的代理。使用该数据集,我们将建筑年份预测任务框定为序数回归,并引入了流行度感知的区间准确度指标来量化这种偏差。我们构建的包含30多个模型的基准,包括我们的YearCLIP模型,证实了VLMs在流行、记忆化的项目上表现出色,但在未识别的主题上却面临重大挑战,揭示了它们推理能力中的关键缺陷。项目页面:https://sytwu.github.io/BeyondMemo/
Summary / 总结
The paper addresses the significant popularity bias in state-of-the-art vision-language models (VLMs) by introducing the YearGuessr dataset, which includes 55,546 building images with multi-modal attributes and continuous ordinal labels of construction years. The study uses ordinal regression and popularity-aware metrics to expose that VLMs perform better on famous buildings than ordinary ones, indicating a reliance on memorization rather than general understanding. The benchmark involving 30+ models, including YearCLIP, confirms this bias, highlighting the need for models to improve their reasoning capabilities for less recognized subjects.
该研究通过引入包含55,546栋建筑图像和多模态属性的YearGuessr数据集,揭示了最先进的视觉-语言模型(VLMs)中的显著流行度偏差,这些图像带有连续的按年份排序标签。研究使用序数回归和流行度感知指标来展示VLMs在著名建筑上的表现优于普通建筑,表明它们更多依赖于记忆而非泛化理解。涉及30多种模型的基准测试,包括YearCLIP,证实了这一偏差,突显了模型需要提高对不知名主题的推理能力。
Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty
Authors: Ziyu Chen, Xinbei Jiang, Peng Sun, Tao Lin
First: 2025-12-24T18:59:51+00:00 · Latest: 2025-12-24T18:59:51+00:00
Abstract
Masked Diffusion Models (MDMs) offer flexible, non-autoregressive generation, but this freedom introduces a challenge: final output quality is highly sensitive to the decoding order. We are the first to formalize this issue, attributing the variability in output quality to the cumulative predictive uncertainty along a generative path. To quantify this uncertainty, we introduce Denoising Entropy, a computable metric that serves as an internal signal for evaluating generative process. Leveraging this metric, we propose two algorithms designed to optimize the decoding path: a post-hoc selection method and a real-time guidance strategy. Experiments demonstrate that our entropy-guided methods significantly improve generation quality, consistently boosting accuracy on challenging reasoning, planning, and code benchmarks. Our work establishes Denoising Entropy as a principled tool for understanding and controlling generation, effectively turning the uncertainty in MDMs from a liability into a key advantage for discovering high-quality solutions.
中文标题/摘要
标题:通过量化不确定性优化掩码扩散模型的解码路径
掩码扩散模型(MDMs)提供了灵活的非自回归生成,但这种自由度引入了一个挑战:最终输出质量高度依赖于解码顺序。我们首次正式化了这一问题,将输出质量的变异性归因于生成路径上的累积预测不确定性。为了量化这种不确定性,我们引入了去噪熵,这是一种可计算的度量标准,作为评估生成过程的内部信号。利用这一度量标准,我们提出了两种旨在优化解码路径的算法:一种事后选择方法和一种实时指导策略。实验表明,我们的熵导向方法显著提高了生成质量,在具有挑战性的推理、规划和代码基准测试中持续提升了准确性。我们的工作确立了去噪熵作为理解并控制生成过程的原理性工具,有效地将MDMs中的不确定性从一种负担转变为发现高质量解决方案的关键优势。
Summary / 总结
The research aims to address the variability in output quality of Masked Diffusion Models (MDMs) due to the sensitivity of the final output to the decoding order. The authors introduce Denoising Entropy as a metric to quantify the cumulative predictive uncertainty along a generative path and propose two algorithms: a post-hoc selection method and a real-time guidance strategy. Experiments show that these entropy-guided methods enhance generation quality, particularly on complex reasoning, planning, and code benchmarks.
研究旨在解决Masked Diffusion Models (MDMs)由于其灵活的解码顺序而导致的输出质量波动问题。为了量化这种不确定性,作者引入了去噪熵(Denoising Entropy),这是一种可计算的度量标准。他们提出了两种算法:一种是后处理选择方法,另一种是实时指导策略,都利用去噪熵来优化解码路径。实验结果表明,这些熵导向的方法显著提高了生成质量,特别是在涉及推理、规划和代码生成的复杂基准测试中表现尤为突出。
Streaming Video Instruction Tuning
Authors: Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou
First: 2025-12-24T18:59:36+00:00 · Latest: 2025-12-24T18:59:36+00:00
Abstract
We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.
中文标题/摘要
标题:流式视频指令调优
我们提出了Streamo,一种实时流式视频LLM,作为通用交互式助手。与现有的专注于问答或字幕的在线视频模型不同,Streamo执行广泛的流式视频任务,包括实时解说、动作理解、事件字幕、时间事件定位和时间敏感的问答。为了开发这种多功能性,我们构建了Streamo-Instruct-465K,一个针对流式视频理解的大规模指令遵循数据集。该数据集涵盖了多种时间上下文和多任务监督,使Streamo能够在异构流式任务中统一训练。通过简化的工作流在指令遵循数据集上端到端训练后,Streamo展示了强大的时间推理、响应式交互和在各种流式基准测试中的广泛泛化能力。广泛的实验表明,Streamo填补了离线视频感知模型与实时多模态助手之间的差距,朝着统一、智能的视频理解在连续视频流中的目标迈出了一步。
Summary / 总结
Streamo is a real-time streaming video LLM designed as a general-purpose interactive assistant. It excels in a wide range of streaming video tasks such as real-time narration, action understanding, and event captioning. To achieve this versatility, the researchers created Streamo-Instruct-465K, a large instruction-following dataset for streaming video understanding. After training, Streamo demonstrates strong temporal reasoning and broad generalization across various streaming benchmarks, bridging the gap between offline video models and real-time multimodal assistants.
研究旨在开发一种实时流媒体视频助手Streamo,能够执行实时叙述、事件字幕等广泛任务。为此,作者创建了Streamo-Instruct-465K,一个针对流媒体视频理解的大规模指令遵循数据集。经过训练后,Streamo展示了强大的时间推理能力和在各种流媒体基准测试中的广泛泛化能力,填补了离线视频模型与实时多模态助手之间的差距。
Fast SAM2 with Text-Driven Token Pruning
Authors: Avilasha Mandal, Chaoning Zhang, Fachrina Dewi Puspitasari, Xudong Wang, Jiaquan Zhang, Caiyan Qin, Guoqing Wang, Yang Yang, Heng Tao Shen
First: 2025-12-24T18:59:05+00:00 · Latest: 2025-12-24T18:59:05+00:00
Comments: 28 pages, 9 figures
Abstract
Segment Anything Model 2 (SAM2), a vision foundation model has significantly advanced in prompt-driven video object segmentation, yet their practical deployment remains limited by the high computational and memory cost of processing dense visual tokens across time. The SAM2 pipelines typically propagate all visual tokens produced by the image encoder through downstream temporal reasoning modules, regardless of their relevance to the target object, resulting in reduced scalability due to quadratic memory attention overhead. In this work, we introduce a text-guided token pruning framework that improves inference efficiency by selectively reducing token density prior to temporal propagation, without modifying the underlying segmentation architecture. Operating after visual encoding and before memory based propagation, our method ranks tokens using a lightweight routing mechanism that integrates local visual context, semantic relevance derived from object-centric textual descriptions (either user-provided or automatically generated), and uncertainty cues that help preserve ambiguous or boundary critical regions. By retaining only the most informative tokens for downstream processing, the proposed approach reduces redundant computation while maintaining segmentation fidelity. Extensive experiments across multiple challenging video segmentation benchmarks demonstrate that post-encoder token pruning provides a practical and effective pathway to efficient, prompt-aware video segmentation, achieving up to 42.50 percent faster inference and 37.41 percent lower GPU memory usage compared to the unpruned baseline SAM2, while preserving competitive J and F performance. These results highlight the potential of early token selection to improve the scalability of transformer-based video segmentation systems for real-time and resource-constrained applications.
中文标题/摘要
标题:快速SAM2:基于文本驱动的标记剪枝
Segment Anything Model 2 (SAM2) 是一个视觉基础模型,在基于提示的视频对象分割方面取得了显著进展,但其实际部署受限于处理时间密集视觉标记的高计算和内存成本。SAM2 管道通常会将图像编码器生成的所有视觉标记通过下游的时间推理模块进行传递,而不管这些标记是否与目标对象相关,这导致由于基于内存的注意力开销呈二次增长而降低了可扩展性。在本文中,我们提出了一种基于文本的标记剪枝框架,通过在时间传播之前选择性地减少标记密度来提高推理效率,而不修改底层的分割架构。该方法在视觉编码之后、基于内存的传播之前运行,使用一种轻量级的路由机制对标记进行排名,该机制结合了局部视觉上下文、从以对象为中心的文本描述(用户提供的或自动生成的)中推导出的语义相关性以及有助于保留模糊或边界关键区域的不确定性提示。通过仅保留对下游处理最有用的标记,所提出的方法减少了冗余计算,同时保持了分割精度。在多个具有挑战性的视频分割基准测试中的广泛实验表明,编码器后的标记剪枝提供了一条实用且有效的途径,以实现基于提示的视频分割的高效性,与未剪枝的基线SAM2相比,其推理速度提高了42.50%,GPU内存使用量降低了37.41%,同时保持了竞争力的J和F性能。这些结果突显了早期标记选择对提高基于变压器的视频分割系统实时性和资源受限应用可扩展性的潜力。
Summary / 总结
This work introduces a text-guided token pruning framework for Segment Anything Model 2 (SAM2) to enhance inference efficiency in video object segmentation. By selectively reducing token density before temporal propagation, the method ranks tokens using a lightweight routing mechanism that considers local visual context, semantic relevance, and uncertainty cues. Experiments show that this approach achieves up to 42.50% faster inference and 37.41% lower GPU memory usage while maintaining competitive segmentation performance on multiple benchmarks.
该研究提出了一种文本引导的标记剪枝框架,用于提升Segment Anything Model 2 (SAM2) 在视频对象分割中的推理效率。通过在时间传播前选择性地减少标记密度,该方法在不改变分割架构的情况下提高了可扩展性。实验结果显示,该方法将推理时间最多缩短了42.50%,GPU内存使用量降低了37.41%,同时在多个基准测试中保持了竞争力的分割性能。
C2LLM Technical Report: A New Frontier in Code Retrieval via Adaptive Cross-Attention Pooling
Authors: Jin Qin, Zihan Liao, Ziyin Zhang, Hang Yu, Peng Di, Rui Wang
First: 2025-12-24T18:59:01+00:00 · Latest: 2025-12-24T18:59:01+00:00
Abstract
We present C2LLM - Contrastive Code Large Language Models, a family of code embedding models in both 0.5B and 7B sizes. Building upon Qwen-2.5-Coder backbones, C2LLM adopts a Pooling by Multihead Attention (PMA) module for generating sequence embedding from token embeddings, effectively 1) utilizing the LLM's causal representations acquired during pretraining, while also 2) being able to aggregate information from all tokens in the sequence, breaking the information bottleneck in EOS-based sequence embeddings, and 3) supporting flexible adaptation of embedding dimension, serving as an alternative to MRL. Trained on three million publicly available data, C2LLM models set new records on MTEB-Code among models of similar sizes, with C2LLM-7B ranking 1st on the overall leaderboard.
中文标题/摘要
标题:C2LLM 技术报告:通过自适应交叉注意力池化在代码检索中的新前沿
我们提出了 C2LLM - 对比代码大型语言模型,这是一个包含 0.5B 和 7B 两种规模的代码嵌入模型。基于 Qwen-2.5-Coder 主干,C2LLM 采用多头注意力模块(PMA)生成序列嵌入,有效利用了 LLM 在预训练过程中获得的因果表示,同时能够从序列中所有标记中聚合信息,打破基于 EOS 的序列嵌入的信息瓶颈,并支持嵌入维度的灵活调整,作为 MRL 的替代方案。C2LLM 模型在 MTEB-Code 上的性能超越了类似规模的其他模型,其中 C2LLM-7B 在整体排行榜上排名第一。
Summary / 总结
C2LLM is a family of code embedding models based on Qwen-2.5-Coder backbones, which uses a Pooling by Multihead Attention (PMA) module to generate sequence embeddings. This method effectively utilizes the LLM's causal representations, aggregates information from all tokens, and supports flexible embedding dimension. Experiments on three million publicly available data show that C2LLM models, especially C2LLM-7B, set new records on MTEB-Code, ranking 1st on the overall leaderboard.
C2LLM 是基于 Qwen-2.5-Coder 后端的一系列代码嵌入模型,使用了多头注意力(PMA)模块来生成序列嵌入。该方法有效地利用了 LLM 在预训练过程中获得的因果表示,并从所有标记中聚合信息,打破了基于序列结束符的嵌入的信息瓶颈。C2LLM 模型在三百万公开数据上进行训练,其在 MTEB-Code 上的表现优于类似规模的其他模型,C2LLM-7B 在整体排行榜上排名第一。
Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks
Authors: Xinhe Wang, Jin Huang, Xingjian Zhang, Tianhao Wang, Jiaqi W. Ma
First: 2025-12-24T18:58:04+00:00 · Latest: 2025-12-24T18:58:04+00:00
Abstract
Reasoning benchmarks such as the Abstraction and Reasoning Corpus (ARC) and ARC-AGI are widely used to assess progress in artificial intelligence and are often interpreted as probes of core, so-called ``fluid'' reasoning abilities. Despite their apparent simplicity for humans, these tasks remain challenging for frontier vision-language models (VLMs), a gap commonly attributed to deficiencies in machine reasoning. We challenge this interpretation and hypothesize that the gap arises primarily from limitations in visual perception rather than from shortcomings in inductive reasoning.
To verify this hypothesis, we introduce a two-stage experimental pipeline that explicitly separates perception and reasoning. In the perception stage, each image is independently converted into a natural-language description, while in the reasoning stage a model induces and applies rules using these descriptions. This design prevents leakage of cross-image inductive signals and isolates reasoning from perception bottlenecks. Across three ARC-style datasets, Mini-ARC, ACRE, and Bongard-LOGO, we show that the perception capability is the dominant factor underlying the observed performance gap by comparing the two-stage pipeline with against standard end-to-end one-stage evaluation. Manual inspection of reasoning traces in the VLM outputs further reveals that approximately 80 percent of model failures stem from perception errors. Together, these results demonstrate that ARC-style benchmarks conflate perceptual and reasoning challenges and that observed performance gaps may overstate deficiencies in machine reasoning. Our findings underscore the need for evaluation protocols that disentangle perception from reasoning when assessing progress in machine intelligence.
中文标题/摘要
标题:您的推理基准可能无法测试推理能力:揭示抽象推理基准中的感知瓶颈
诸如抽象和推理语料库(ARC)和ARC-AGI之类的推理基准被广泛用于评估人工智能的进步,并且通常被解释为对所谓的“流体”推理能力的核心探针。尽管这些任务对人类来说看似简单,但对于前沿的视觉-语言模型(VLMs)来说仍然具有挑战性,这一差距通常归因于机器推理的缺陷。我们挑战这种解释,并假设这一差距主要源自视觉感知的限制,而不是归纳推理能力的不足。
为了验证这一假设,我们引入了一个两阶段的实验管道,明确地将感知和推理分离。在感知阶段,每张图片独立地转换为自然语言描述,而在推理阶段,模型使用这些描述来推导和应用规则。这种设计防止了跨图片归纳信号的泄露,并将推理与感知瓶颈隔离。在三个ARC风格的数据集中,Mini-ARC、ACRE和Bongard-LOGO,我们通过将两阶段管道与标准端到端一阶段评估进行比较,展示了感知能力是导致观察到的性能差距的主要因素。对VLM输出中的推理痕迹的手动检查进一步表明,大约80%的模型失败源于感知错误。综上所述,这些结果表明ARC风格的基准将感知和推理挑战混为一谈,而观察到的性能差距可能夸大了机器推理的缺陷。我们的研究结果强调了在评估机器智能进展时需要分离感知和推理的评估协议的必要性。
Summary / 总结
The study challenges the notion that performance gaps in reasoning benchmarks like ARC and ARC-AGI are due to machine reasoning deficiencies, instead suggesting that these gaps are primarily due to limitations in visual perception. A two-stage experimental pipeline was used to separate perception and reasoning, showing that perception capability is the dominant factor in the performance gap. Manual inspection revealed that about 80% of model failures are due to perception errors, indicating that these benchmarks may overstate machine reasoning deficiencies.
研究挑战了ARC和ARC-AGI等推理基准中性能差距通常归因于推理缺陷的观点,而是认为视觉感知限制是主要原因。它引入了一种两阶段实验管道来分离感知和推理,显示感知能力是导致性能差距的主要因素。手动分析表明,约80%的模型失败是由于感知错误造成的,这表明这些基准可能夸大了推理缺陷的程度。
Measuring all the noises of LLM Evals
Authors: Sida Wang
First: 2025-12-24T18:54:37+00:00 · Latest: 2025-12-24T18:54:37+00:00
Abstract
Separating signal from noise is central to experimental science. Applying well-established statistical method effectively to LLM evals requires consideration of their unique noise characteristics. We clearly define and measure three types of noise: prediction noise from generating different answers on a given question, data noise from sampling questions, and their combined total noise following the law of total variance. To emphasize relative comparisons and gain statistical power, we propose the all-pairs paired method, which applies the paired analysis to all pairs of LLMs and measures all the noise components based on millions of question-level predictions across many evals and settings. These measurements revealed clear patterns. First, each eval exhibits a characteristic and highly predictable total noise level across all model pairs. Second, paired prediction noise typically exceeds paired data noise, which means reducing prediction noise by averaging can significantly increase statistical power. These findings enable practitioners to assess significance without custom testing and to detect much smaller effects in controlled experiments.
中文标题/摘要
标题:测量所有大模型评估中的噪音
从信号中分离噪音是实验科学的核心。将成熟的统计方法有效地应用于大模型评估需要考虑其独特的噪音特征。我们明确定义并测量了三种类型的噪音:在给定问题上生成不同答案的预测噪音、从采样问题中产生的数据噪音以及它们的总噪音,后者遵循总体方差定律。为了强调相对比较并获得统计功效,我们提出了所有成对配对方法,该方法将配对分析应用于所有大模型对,并基于数百万个问题级别的预测结果,在许多评估和设置中测量所有噪音成分。这些测量揭示了清晰的模式。首先,每个评估在所有模型对中表现出一种高度可预测的总噪音水平。其次,成对的预测噪音通常超过成对的数据噪音,这意味着通过平均来减少预测噪音可以显著提高统计功效。这些发现使实践者能够在无需自定义测试的情况下评估显著性,并在受控实验中检测到更小的效果。
Summary / 总结
The study aims to measure and separate different types of noise in evaluations of large language models (LLMs) to improve the reliability of experimental results. The authors define and quantify three types of noise: prediction noise, data noise, and their combined total noise. They propose the all-pairs paired method, which analyzes all pairs of LLMs using millions of question-level predictions to measure these noise components. Key findings include the predictable total noise level across all model pairs and the higher paired prediction noise compared to data noise, suggesting that reducing prediction noise can enhance statistical power in experiments.
研究旨在通过分离不同类型的噪声来提高大型语言模型(LLM)评估结果的可靠性。作者定义并量化了三种噪声类型:预测噪声、数据噪声及其总噪声。他们提出了所有配对方法,该方法使用数百万个问题级别的预测结果对各种评估中的所有模型对进行分析。主要发现包括所有模型对的一致性总噪声水平以及预测噪声在配对噪声中占主导地位,这表明通过平均减少预测噪声可以增强实验中的统计功效。
View-aware Cross-modal Distillation for Multi-view Action Recognition
Authors: Trung Thanh Nguyen, Yasutomo Kawanishi, Vijay John, Takahiro Komamizu, Ichiro Ide
Venue: WACV
First: 2025-11-17T02:00:22+00:00 · Latest: 2025-12-24T18:29:21+00:00
Comments: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026
Abstract
The widespread use of multi-sensor systems has increased research in multi-view action recognition. While existing approaches in multi-view setups with fully overlapping sensors benefit from consistent view coverage, partially overlapping settings where actions are visible in only a subset of views remain underexplored. This challenge becomes more severe in real-world scenarios, as many systems provide only limited input modalities and rely on sequence-level annotations instead of dense frame-level labels. In this study, we propose View-aware Cross-modal Knowledge Distillation (ViCoKD), a framework that distills knowledge from a fully supervised multi-modal teacher to a modality- and annotation-limited student. ViCoKD employs a cross-modal adapter with cross-modal attention, allowing the student to exploit multi-modal correlations while operating with incomplete modalities. Moreover, we propose a View-aware Consistency module to address view misalignment, where the same action may appear differently or only partially across viewpoints. It enforces prediction alignment when the action is co-visible across views, guided by human-detection masks and confidence-weighted Jensen-Shannon divergence between their predicted class distributions. Experiments on the real-world MultiSensor-Home dataset show that ViCoKD consistently outperforms competitive distillation methods across multiple backbones and environments, delivering significant gains and surpassing the teacher model under limited conditions.
中文标题/摘要
标题:基于视图感知的跨模态蒸馏方法在多视图动作识别中的应用
多传感器系统的广泛应用增加了多视图动作识别的研究。虽然现有方法在完全重叠传感器的多视图设置中受益于一致的视图覆盖,但在部分重叠设置中,动作仅在部分视图中可见,这些设置仍然未被充分探索。在现实场景中,这一挑战更为严重,因为许多系统只能提供有限的输入模态,并依赖于序列级注释而不是密集帧级标签。在本研究中,我们提出了一种基于视图感知的跨模态知识蒸馏(ViCoKD)框架,该框架从全监督多模态教师中蒸馏知识到模态和注释受限的学生。ViCoKD 使用跨模态适配器和跨模态注意力,使学生能够利用多模态相关性,同时在不完整模态下操作。此外,我们提出了一种基于视图感知的一致性模块来解决视图对齐问题,其中同一动作在不同视角下可能表现不同或仅部分可见。该模块通过人类检测掩码和预测类别分布加权的Jensen-Shannon距离强制执行动作在不同视图中同时可见时的预测对齐。在实际的多传感器家庭数据集上的实验表明,ViCoKD 在多个骨干网络和环境中始终优于竞争性蒸馏方法,即使在条件受限的情况下也取得了显著的性能提升,超越了教师模型。
Summary / 总结
This study addresses the challenge of multi-view action recognition in partially overlapping sensor setups, where actions are visible in only a subset of views. The proposed ViCoKD framework distills knowledge from a fully supervised multi-modal teacher to a modality- and annotation-limited student. It uses a cross-modal adapter with attention to exploit multi-modal correlations and a View-aware Consistency module to handle view misalignment. Experiments on the MultiSensor-Home dataset demonstrate that ViCoKD outperforms other distillation methods and surpasses the teacher model under limited conditions.
该研究解决了部分重叠传感器设置下的多视角动作识别挑战,其中动作仅在部分视角中可见。提出的ViCoKD框架通过交叉模态适配器和视图一致模块,从全监督多模态教师中提取知识到模态和标注受限的学生模型。实验结果表明,ViCoKD在多种骨干网络和环境中均优于其他蒸馏方法,并在受限条件下超越了教师模型。
Scaling Laws for Economic Productivity: Experimental Evidence in LLM-Assisted Consulting, Data Analyst, and Management Tasks
Authors: Ali Merali
First: 2025-12-24T18:24:29+00:00 · Latest: 2025-12-24T18:24:29+00:00
Abstract
This paper derives `Scaling Laws for Economic Impacts' -- empirical relationships between the training compute of Large Language Models (LLMs) and professional productivity. In a preregistered experiment, over 500 consultants, data analysts, and managers completed professional tasks using one of 13 LLMs. We find that each year of AI model progress reduced task time by 8%, with 56% of gains driven by increased compute and 44% by algorithmic progress. However, productivity gains were significantly larger for non-agentic analytical tasks compared to agentic workflows requiring tool use. These findings suggest continued model scaling could boost U.S. productivity by approximately 20% over the next decade.
中文标题/摘要
标题:经济生产力的标度律:大规模语言模型辅助咨询、数据分析和管理任务的实验证据
本文推导出“经济影响的标度律”——大规模语言模型(LLMs)的训练计算量与专业生产力之间的经验关系。在一项预先注册的实验中,超过500名咨询师、数据分析师和管理者使用13种不同的LLM完成专业任务。我们发现,每一年的AI模型进步减少了8%的任务时间,其中56%的收益来自计算量的增加,44%来自算法的进步。然而,非代理性分析任务的生产力增长显著大于需要工具使用的代理性工作流程。这些发现表明,持续的模型扩展在未来十年内可能使美国的生产力提高约20%。
Summary / 总结
This paper investigates the relationship between the training compute of Large Language Models (LLMs) and professional productivity. Through a preregistered experiment involving over 500 consultants, data analysts, and managers, the study finds that each year of AI model progress reduces task time by 8%, with 56% of the gains attributed to increased compute and 44% to algorithmic progress. Notably, productivity gains were more significant for non-agentic analytical tasks compared to agentic workflows requiring tool use, suggesting potential for substantial productivity improvements in the U.S. economy over the next decade.
该研究通过涉及超过500名咨询师、数据分析师和管理人员的预先注册实验,探讨了大型语言模型(LLMs)对专业生产力的影响。研究发现,每一年的AI模型进步可以减少8%的任务时间,其中56%的增益来自计算能力的提升,44%来自算法的进步。值得注意的是,非代理型分析任务的生产力增益比需要工具使用的代理型工作流程更大,这表明在未来十年中,美国经济可能实现约20%的生产力提升。
Learning to Solve PDEs on Neural Shape Representations
Authors: Lilian Welschinger, Yilin Liu, Zican Wang, Niloy Mitra
First: 2025-12-24T18:14:02+00:00 · Latest: 2025-12-24T18:14:02+00:00
Comments: Article webpage link: https://welschinger.github.io/Learning-to-Solve-PDEs-on-Neural-Shape-Representations/
Abstract
Solving partial differential equations (PDEs) on shapes underpins many shape analysis and engineering tasks; yet, prevailing PDE solvers operate on polygonal/triangle meshes while modern 3D assets increasingly live as neural representations. This mismatch leaves no suitable method to solve surface PDEs directly within the neural domain, forcing explicit mesh extraction or per-instance residual training, preventing end-to-end workflows. We present a novel, mesh-free formulation that learns a local update operator conditioned on neural (local) shape attributes, enabling surface PDEs to be solved directly where the (neural) data lives. The operator integrates naturally with prevalent neural surface representations, is trained once on a single representative shape, and generalizes across shape and topology variations, enabling accurate, fast inference without explicit meshing or per-instance optimization while preserving differentiability. Across analytic benchmarks (heat equation and Poisson solve on sphere) and real neural assets across different representations, our method slightly outperforms CPM while remaining reasonably close to FEM, and, to our knowledge, delivers the first end-to-end pipeline that solves surface PDEs on both neural and classical surface representations. Code will be released on acceptance.
中文标题/摘要
标题:在神经形状表示上学习求解偏微分方程
在形状上求解偏微分方程(PDEs)是许多形状分析和工程任务的基础;然而,现有的PDE求解器通常基于多边形/三角形网格,而现代3D资产越来越多地以神经表示形式存在。这种不匹配使得没有合适的方法可以直接在神经域内求解曲面PDEs,迫使进行显式的网格提取或逐实例残差训练,阻碍了端到端的工作流程。我们提出了一种新的无网格公式,该公式学习一个基于神经(局部)形状属性的局部更新算子,使得可以直接在数据所在的曲面上求解PDEs。该算子自然地与常见的神经曲面表示相结合,只需在一个代表性形状上进行一次训练,即可在形状和拓扑变化中泛化,从而在无需显式网格化或逐实例优化的情况下实现准确且快速的推理,同时保持可微性。在分析基准(球体上的热方程和泊松求解)和不同表示的真实神经资产中,我们的方法在某些方面略优于CPM,同时保持与FEM相当的性能,并且据我们所知,首次提供了在神经和经典曲面表示上求解曲面PDEs的端到端管道。代码将在接受后发布。
Summary / 总结
The research aims to address the mismatch between partial differential equation (PDE) solvers that operate on polygonal meshes and modern 3D assets represented as neural shapes. The method introduces a mesh-free formulation that learns a local update operator conditioned on neural shape attributes, allowing PDEs to be solved directly on neural representations. Experiments show that the method outperforms Conditional Physics Modeling (CPM) and remains close to Finite Element Method (FEM) accuracy, providing the first end-to-end pipeline for solving surface PDEs on both neural and classical surface representations.
论文解决了在神经形状表示上求解偏微分方程(PDEs)的问题,这些表示在3D资产中越来越常用。它提出了一种无网格公式,通过条件化于神经形状属性的学习局部更新操作符,使PDEs可以直接在神经域中求解。该方法与神经表面表示集成,只需要单实例训练,并且能够跨形状和拓扑变化进行泛化,从而实现准确且快速的推理,无需显式网格化或实例优化。实验结果显示,该方法在分析基准和真实神经资产上与有限元方法(FEM)相当,并且优于最接近的竞争者。
When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation
Authors: Michael H. Coen
First: 2025-12-18T21:29:43+00:00 · Latest: 2025-12-24T18:05:57+00:00
Comments: 32 pages, 4 figures. Evaluation and methodology study on dialogue topic segmentation
Abstract
Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of work, evaluation practice remains dominated by strict boundary matching and F1-based metrics. Modern large language model (LLM) based conversational systems increasingly rely on segmentation to manage conversation history beyond fixed context windows. In such systems, unstructured context accumulation degrades efficiency and coherence.
This paper introduces an evaluation framework that reports boundary density and segment alignment diagnostics (purity and coverage) alongside window-tolerant F1 (W-F1). By separating boundary scoring from boundary selection, we evaluate segmentation quality across density regimes rather than at a single operating point. Cross-dataset evaluation shows that reported performance differences often reflect annotation granularity mismatch rather than boundary placement quality alone.
We evaluate structurally distinct segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Boundary-based metrics are strongly coupled to boundary density: threshold sweeps produce larger W-F1 changes than switching between methods. These findings support viewing topic segmentation as a granularity selection problem rather than prediction of a single correct boundary set. This motivates separating boundary scoring from boundary selection for analyzing and tuning segmentation under varying annotation granularities.
中文标题/摘要
标题:当F1失效时:面向粒度的对话主题分割评估
对话主题分割支持摘要、检索、记忆管理和对话连续性。尽管已有数十年的工作,评估实践仍主要依赖严格的边界匹配和F1度量。现代基于大型语言模型(LLM)的对话系统越来越多地依赖分割来管理超出固定上下文窗口的对话历史。在这种系统中,非结构化上下文积累会降低效率和连贯性。
本文引入了一种评估框架,该框架报告边界密度和分割对齐诊断(纯度和覆盖率)以及窗口容忍的F1(W-F1)。通过将边界评分与边界选择分离,我们评估了在不同密度范围内的分割质量,而不仅仅是单一的操作点。跨数据集评估表明,报告的性能差异往往反映了注释粒度不匹配,而不仅仅是边界放置质量。
我们评估了八个涵盖任务导向、开放领域、会议风格和合成交互的对话数据集中的结构上不同的分割策略。基于边界的度量与边界密度紧密相关:阈值扫描产生的W-F1变化大于方法之间的切换。这些发现支持将主题分割视为粒度选择问题,而不是预测单一正确的边界集的观点。这促使我们将边界评分与边界选择分离,以便在不同注释粒度下分析和调整分割。
Summary / 总结
This paper addresses the limitations of F1-based metrics in evaluating dialogue topic segmentation, introducing a new evaluation framework that includes boundary density, segment alignment diagnostics, and window-tolerant F1 (W-F1). The study finds that performance differences often result from annotation granularity mismatches rather than boundary placement quality. Across eight dialogue datasets, the research shows that boundary-based metrics are highly dependent on boundary density, suggesting that topic segmentation should be viewed as a granularity selection problem rather than a single correct boundary prediction task.
该论文针对基于F1的评价指标在现代大型语言模型中对话题分割评估的局限性,特别是在不结构化上下文累积影响效率和连贯性的情况下。它引入了一个评价框架,包括边界密度、段落对齐诊断和窗口容忍F1(W-F1),以在不同密度区间评估分割质量。研究结果显示,基于边界的度量标准高度依赖于边界密度,表明话题分割应被视为粒度选择问题而非单一正确边界集的预测任务。
Intrinsic Benefits of Categorical Distributional Loss: Uncertainty-aware Regularized Exploration in Reinforcement Learning
Authors: Ke Sun, Yingnan Zhao, Enze Shi, Yafei Wang, Xiaodong Yan, Bei Jiang, Linglong Kong
Venue: NeurIPS 2025
First: 2021-10-07T03:14:46+00:00 · Latest: 2025-12-24T17:53:45+00:00
Comments: NeurIPS 2025; Previous Version in ICML Workshop: Exploration in AI Today (EXAIT) 2025
Abstract
The remarkable empirical performance of distributional reinforcement learning (RL) has garnered increasing attention to understanding its theoretical advantages over classical RL. By decomposing the categorical distributional loss commonly employed in distributional RL, we find that the potential superiority of distributional RL can be attributed to a derived distribution-matching entropy regularization. This less-studied entropy regularization aims to capture additional knowledge of return distribution beyond only its expectation, contributing to an augmented reward signal in policy optimization. In contrast to the vanilla entropy regularization in MaxEnt RL, which explicitly encourages exploration by promoting diverse actions, the novel entropy regularization derived from categorical distributional loss implicitly updates policies to align the learned policy with (estimated) environmental uncertainty. Finally, extensive experiments verify the significance of this uncertainty-aware regularization from distributional RL on the empirical benefits over classical RL. Our study offers an innovative exploration perspective to explain the intrinsic benefits of distributional learning in RL.
中文标题/摘要
标题:分类分布损失的内在优势:分布感知正则化探索在强化学习中的应用
分布式强化学习(RL)的卓越实证性能引起了对其与经典RL理论优势的越来越多关注。通过分解在分布式RL中常用的分类分布损失,我们发现分布式RL潜在优势可归因于一种衍生的分布匹配熵正则化。这种较少研究的熵正则化旨在捕捉回报分布的额外知识,而不仅仅是其期望值,从而为策略优化提供增强的奖励信号。与MaxEnt RL中的基本熵正则化相比,后者通过促进多样化的动作显式地鼓励探索,而从分类分布损失中推导出的新型熵正则化则隐式地更新策略,使其与(估计的)环境不确定性相一致。最后,广泛的实验验证了这种分布感知正则化在实证上对经典RL的优越性。我们的研究提供了一种创新的探索视角,以解释分布式学习在RL中的内在优势。
Summary / 总结
This study investigates the theoretical advantages of distributional reinforcement learning (RL) by decomposing the categorical distributional loss. It finds that the distribution-matching entropy regularization derived from this loss can capture additional knowledge of return distribution beyond its expectation, enhancing the reward signal in policy optimization. Unlike the explicit exploration encouragement in MaxEnt RL, this regularization implicitly aligns the learned policy with environmental uncertainty, leading to improved empirical performance over classical RL. Extensive experiments confirm the significance of this uncertainty-aware regularization in distributional RL.
研究通过分解分类分布损失,揭示了一种熵正则化方法,该方法能够捕捉回报分布超过其期望的部分。这种正则化方法通过使学习策略与环境不确定性对齐来隐式地促进探索,不同于MaxEnt RL中通过促进多样动作来显式地鼓励探索。大量实验验证了这种不确定性意识正则化在RL中的实际优势超过传统方法。
AndroidLens: Long-latency Evaluation with Nested Sub-targets for Android GUI Agents
Authors: Yue Cao, Yingyao Wang, Pi Bu, Jingxuan Xing, Wei Jiang, Zekun Zhu, Junpeng Ma, Sashuai Zhou, Tong Lu, Jun Song, Yu Cheng, Yuning Jiang, Bo Zheng
First: 2025-12-24T17:40:42+00:00 · Latest: 2025-12-24T17:40:42+00:00
Comments: 23 pages, 13 figures, 8 tables
Abstract
Graphical user interface (GUI) agents can substantially improve productivity by automating frequently executed long-latency tasks on mobile devices. However, existing evaluation benchmarks are still constrained to limited applications, simple tasks, and coarse-grained metrics. To address this, we introduce AndroidLens, a challenging evaluation framework for mobile GUI agents, comprising 571 long-latency tasks in both Chinese and English environments, each requiring an average of more than 26 steps to complete. The framework features: (1) tasks derived from real-world user scenarios across 38 domains, covering complex types such as multi-constraint, multi-goal, and domain-specific tasks; (2) static evaluation that preserves real-world anomalies and allows multiple valid paths to reduce bias; and (3) dynamic evaluation that employs a milestone-based scheme for fine-grained progress measurement via Average Task Progress (ATP). Our evaluation indicates that even the best models reach only a 12.7% task success rate and 50.47% ATP. We also underscore key challenges in real-world environments, including environmental anomalies, adaptive exploration, and long-term memory retention.
中文标题/摘要
标题:AndroidLens:嵌套子目标下的长延迟评估方法用于Android GUI代理
图形用户界面(GUI)代理可以通过自动化移动设备上频繁执行的长延迟任务来显著提高生产力。然而,现有的评估基准仍然局限于有限的应用程序、简单的任务和粗粒度的指标。为了解决这个问题,我们引入了AndroidLens,这是一个针对移动GUI代理的具有挑战性的评估框架,包含571个长延迟任务,涵盖中文和英文环境,每个任务平均需要超过26步才能完成。该框架的特点是:(1) 来自38个领域的真实世界用户场景的任务,涵盖多种复杂类型,如多约束、多目标和领域特定任务;(2) 静态评估保留了真实世界的异常情况,并允许多条有效路径以减少偏差;(3) 动态评估采用基于里程碑的方案,通过平均任务进度(ATP)进行细粒度的进度测量。我们的评估表明,即使是最优秀的模型也只能达到12.7%的任务成功率和50.47%的ATP。我们还强调了真实世界环境中的一些关键挑战,包括环境异常、自适应探索和长期记忆保持。
Summary / 总结
The research introduces AndroidLens, a framework for evaluating mobile GUI agents with long-latency tasks, comprising 571 tasks across 38 domains requiring over 26 steps each. The framework includes static and dynamic evaluations to measure task success and progress, revealing that even top models achieve only 12.7% success and 50.47% average task progress. Challenges such as environmental anomalies and long-term memory retention are highlighted as key issues in real-world environments.
研究引入了包含571个跨38个领域的长延迟任务的AndroidLens框架,每个任务需要超过26步。框架包括静态和动态评估来衡量任务成功率和进度。关键发现表明,即使是最优模型也只能达到12.7%的任务成功率和50.47%的平均任务进度(ATP)。面临的挑战包括环境异常、自适应探索和长期记忆保持。
Transcriptome-Conditioned Personalized De Novo Drug Generation for AML Using Metaheuristic Assembly and Target-Driven Filtering
Authors: Abdullah G. Elafifi, Basma Mamdouh, Mariam Hanafy, Muhammed Alaa Eldin, Yosef Khaled, Nesma Mohamed El-Gelany, Tarek H. M. Abou-El-Enien
First: 2025-12-24T17:39:37+00:00 · Latest: 2025-12-24T17:39:37+00:00
Abstract
Acute Myeloid Leukemia (AML) remains a clinical challenge due to its extreme molecular heterogeneity and high relapse rates. While precision medicine has introduced mutation-specific therapies, many patients still lack effective, personalized options. This paper presents a novel, end-to-end computational framework that bridges the gap between patient-specific transcriptomics and de novo drug discovery. By analyzing bulk RNA sequencing data from the TCGA-LAML cohort, the study utilized Weighted Gene Co-expression Network Analysis (WGCNA) to prioritize 20 high-value biomarkers, including metabolic transporters like HK3 and immune-modulatory receptors such as SIGLEC9. The physical structures of these targets were modeled using AlphaFold3, and druggable hotspots were quantitatively mapped via the DOGSiteScorer engine. Then developed a novel, reaction-first evolutionary metaheuristic algorithm as well as multi-objective optimization programming that assembles novel ligands from fragment libraries, guided by spatial alignment to these identified hotspots. The generative model produced structurally unique chemical entities with a strong bias toward drug-like space, as evidenced by QED scores peaking between 0.5 and 0.7. Validation through ADMET profiling and SwissDock molecular docking identified high-confidence candidates, such as Ligand L1, which achieved a binding free energy of -6.571 kcal/mol against the A08A96 biomarker. These results demonstrate that integrating systems biology with metaheuristic molecular assembly can produce pharmacologically viable, patient tailored leads, offering a scalable blueprint for precision oncology in AML and beyond
中文标题/摘要
标题:基于转录组的个性化从头药物生成用于AML:使用元启发式组装和靶向筛选
急性髓系白血病(AML)由于其极端的分子异质性和高复发率,仍然是临床挑战。尽管精准医疗引入了针对突变的治疗方法,但许多患者仍然缺乏有效的个性化选择。本文提出了一种全新的端到端计算框架,将患者特异性转录组学与从头药物发现联系起来。通过分析TCGA-LAML队列的大规模RNA测序数据,研究利用加权基因共表达网络分析(WGCNA)优先筛选出20个高价值生物标志物,包括代谢转运蛋白如HK3和免疫调节受体如SIGLEC9。这些靶点的物理结构使用AlphaFold3建模,并通过DOGSiteScorer引擎定量映射可成药热点。开发了一种新的反应优先进化元启发式算法以及多目标优化程序,从片段库中组装新型配体,由这些识别的热点的空间对齐引导。生成模型产生了结构上独特的化学实体,药效团性质明显偏向药物样空间,QED评分峰值在0.5到0.7之间。通过ADMET表型分析和SwissDock分子对接验证,识别出高置信度候选物,如配体L1,其与A08A96生物标志物的结合自由能为-6.571 kcal/mol。这些结果表明,将系统生物学与元启发式分子组装相结合可以产生药理学上可行的、患者特异性的先导化合物,为AML和其他癌症的精准肿瘤学提供可扩展的蓝图
Summary / 总结
This study addresses the challenge of personalized drug discovery for Acute Myeloid Leukemia (AML) by integrating patient-specific transcriptomics with de novo drug generation. The framework uses WGCNA to identify key biomarkers, models their structures with AlphaFold3, and assembles novel ligands using a metaheuristic algorithm. Key findings include the generation of drug-like chemical entities and the identification of high-confidence candidates like Ligand L1, which showed strong binding affinity against AML biomarkers.
该研究通过将患者特异性转录组学与从头药物发现相结合,开发了一个端到端的计算框架来应对急性髓系白血病(AML)的临床挑战。该框架使用加权基因共表达网络分析(WGCNA)识别20个高价值生物标志物,使用AlphaFold3建模它们的物理结构,并使用DOGSiteScorer绘制可成药热点。一种新颖的元启发式算法从片段库中组装新型配体,通过空间对齐指导这些热点。生成的化学实体显示出强烈的药物样特性,通过ADMET表型分析和分子对接验证,识别出具有强结合亲和力的高信心候选药物,展示了个性化药物生成在AML中的潜在应用。
Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks
Authors: Daniel Kunin, Giovanni Luca Marchetti, Feng Chen, Dhruva Karkada, James B. Simon, Michael R. DeWeese, Surya Ganguli, Nina Miolane
Venue: NeurIPS 2025
First: 2025-06-06T19:29:13+00:00 · Latest: 2025-12-24T17:26:35+00:00
Comments: 40 pages, 8 figures, NeurIPS 2025
Abstract
What features neural networks learn, and how, remains an open question. In this paper, we introduce Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer networks trained from small initialization. Prior works have shown that gradient flow in this regime exhibits a staircase-like loss curve, alternating between plateaus where neurons slowly align to useful directions and sharp drops where neurons rapidly grow in norm. AGF approximates this behavior as an alternating two-step process: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. AGF begins with all neurons dormant. At each iteration, a dormant neuron activates, triggering the acquisition of a feature and a drop in the loss. AGF quantifies the order, timing, and magnitude of these drops, matching experiments across several commonly studied architectures. We show that AGF unifies and extends existing saddle-to-saddle analyses in fully connected linear networks and attention-only linear transformers, where the learned features are singular modes and principal components, respectively. In diagonal linear networks, we prove AGF converges to gradient flow in the limit of vanishing initialization. Applying AGF to quadratic networks trained to perform modular addition, we give the first complete characterization of the training dynamics, revealing that networks learn Fourier features in decreasing order of coefficient magnitude. Altogether, AGF offers a promising step towards understanding feature learning in neural networks.
中文标题/摘要
标题:交替梯度流:两层神经网络特征学习的理论
神经网络学习哪些特征以及如何学习仍然是一个开放的问题。本文引入了交替梯度流(AGF)算法框架,描述了从小型初始化训练的两层网络中特征学习的动力学。先前的研究表明,在这种情况下,梯度流表现出阶梯状的损失曲线,交替在神经元缓慢对齐到有用方向的平台期和神经元迅速增长的急剧下降期。AGF 将这种行为近似为交替的两步过程:在休眠神经元上最大化一个效用函数,在活跃神经元上最小化一个成本函数。AGF 从所有神经元都处于休眠状态开始。在每次迭代中,一个休眠的神经元激活,触发特征的获取和损失的下降。AGF 量化了这些下降的顺序、时间和幅度,与多个常用架构的实验结果相符。我们证明了 AGF 在完全连接的线性网络和仅注意力线性变换器中统一并扩展了现有的鞍点到鞍点分析,其中学习的特征分别是奇异模式和主成分。在对角线线性网络中,我们证明 AGF 在初始化趋于零的极限下收敛到梯度流。将 AGF 应用于训练以执行模块加法的二次网络,我们首次完整地描述了训练动力学,揭示了网络按系数大小递减顺序学习傅里叶特征。总体而言,AGF 为理解神经网络中的特征学习提供了一个有希望的步骤。
Summary / 总结
The paper introduces Alternating Gradient Flows (AGF), an algorithmic framework that describes feature learning dynamics in two-layer neural networks initialized with small weights. AGF models the process as an alternating two-step procedure: activating dormant neurons to acquire features and minimizing a cost function over active neurons. The framework matches experimental results across various architectures and unifies existing saddle-to-saddle analyses in linear networks and transformers. AGF also characterizes the training dynamics of quadratic networks trained for modular addition, revealing that networks learn Fourier features in decreasing order of coefficient magnitude.
论文提出了交替梯度流(AGF)算法框架,用于描述从小初始化训练的两层神经网络中的特征学习动态。AGF 将梯度流的交替行为近似为两步过程:首先最大化未激活神经元的效用函数,然后最小化活跃神经元的成本函数。关键实验发现表明,AGF 能够匹配各种架构中观察到的损失下降的顺序、时间和幅度,统一并扩展了线性网络和变压器中的鞍点到鞍点分析,并为执行模加的二次网络提供了完整的训练动力学特征表征。
Post-detection inference for sequential changepoint localization
Authors: Aytijhya Saha, Aaditya Ramdas
First: 2025-02-10T02:01:30+00:00 · Latest: 2025-12-24T17:17:29+00:00
Abstract
This paper addresses a fundamental but largely unexplored challenge in sequential changepoint analysis: conducting inference following a detected change. We develop a very general framework to construct confidence sets for the unknown changepoint using only the data observed up to a data-dependent stopping time at which an arbitrary sequential detection algorithm declares a change. Our framework is nonparametric, making no assumption on the composite post-change class, the observation space, or the sequential detection procedure used, and is non-asymptotically valid. We also extend it to handle composite pre-change classes under a suitable assumption, and also derive confidence sets for the change magnitude in parametric settings. We provide theoretical guarantees on the width of our confidence intervals. Extensive simulations demonstrate that the produced sets have reasonable size, and slightly conservative coverage. In summary, we present the first general method for sequential changepoint localization, which is theoretically sound and broadly applicable in practice.
中文标题/摘要
标题:检测后推断在序列变化点定位中的应用
本文解决了序列变化点分析中一个基本但尚未充分探索的挑战:在检测到变化后进行推断。我们开发了一个非常通用的框架,仅使用在数据依赖性停止时间之前观察到的数据来构建未知变化点的置信集,该停止时间由任意序列检测算法声明变化。我们的框架是非参数的,对复合变化后类、观测空间或使用的序列检测过程不做任何假设,并且是非渐近有效的。我们还在此框架下处理复合变化前类,并在适当假设下推导出参数设置下的变化幅度的置信集。我们提供了我们置信区间宽度的理论保证。广泛的模拟表明,生成的集合理论上合理且覆盖适度保守。总之,我们提出了第一个通用的序列变化点定位方法,该方法在理论上是稳健的,并且在实践中具有广泛的应用性。
Summary / 总结
This paper tackles the challenge of conducting inference after a change is detected in sequential changepoint analysis. It introduces a nonparametric framework that constructs confidence sets for the unknown changepoint using data observed up to a stopping time determined by any sequential detection algorithm. The method is non-asymptotically valid and can handle composite pre-change classes under certain assumptions. Theoretical guarantees are provided for the width of the confidence intervals, and simulations show that the confidence sets are reasonably sized with slightly conservative coverage. This work offers the first general method for sequential changepoint localization that is both theoretically sound and broadly applicable in practice.
本文解决了在顺序变化点分析中检测到变化后进行推断的基本但未充分探索的挑战。提出了一种通用框架,利用顺序检测算法确定的停止时间之前观察到的数据来构建未知变化点的置信集。该方法是非参数化的,并且是非渐近有效的,不假设后变化类、观测空间或检测程序。提供了置信区间宽度的理论保证,并通过模拟显示,这些集合理论上合理且保守地覆盖了变化点。这种方法是首个此类方法,提供了理论上可靠且广泛适用的顺序变化点定位方法。
DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment
Authors: Zongcai Du, Guilin Deng, Xiaofeng Guo, Xin Gao, Linke Li, Kaichang Cheng, Fubo Han, Siyu Yang, Peng Liu, Pan Zhong, Qiang Fu
First: 2025-10-10T05:39:45+00:00 · Latest: 2025-12-24T17:16:37+00:00
Comments: ICASSP26 under review. Demo page: https://nju-jet.github.io/DiTSinger
Abstract
Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS.
中文标题/摘要
标题:DiTSinger:基于扩散变换器和隐式对齐扩展歌唱语音合成
基于扩散的歌唱语音合成(SVS)的最新进展展示了强大的表现力,但仍受限于数据稀缺性和模型可扩展性。我们提出了一种两阶段管道:通过固定旋律与多样化的LLM生成歌词配对,构建一个紧凑的种子集人声录音,并训练旋律特定模型生成超过500小时的高质量中文歌唱数据。在此语料库基础上,我们提出DiTSinger,一种基于RoPE和qk-norm的扩散变换器,系统地在深度、宽度和分辨率上进行扩展以增强保真度。此外,我们设计了一种隐式对齐机制,通过在字符级别范围内约束音素到声学的注意力来消除音素级持续时间标签,从而在噪声或不确定对齐下提高鲁棒性。大量实验验证了我们的方法能够实现可扩展、无需对齐和高保真的SVS。
Summary / 总结
The research aims to enhance the scalability and expressiveness of singing voice synthesis (SVS) while addressing data scarcity. A two-stage pipeline is developed, starting with a compact set of human-sung recordings paired with LLM-generated lyrics to train melody-specific models for over 500 hours of high-quality Chinese singing data. DiTSinger, a Diffusion Transformer with RoPE and qk-norm, is then scaled in depth, width, and resolution to improve fidelity. An implicit alignment mechanism is introduced to avoid phoneme-level duration labels, enhancing robustness under noisy or uncertain alignments. Experiments show that this approach enables scalable, alignment-free, and high-fidelity SVS.
研究旨在提升歌唱语音合成的可扩展性和表达能力,同时解决数据稀缺问题。方法包括两阶段管道,将人类演唱录音与LLM生成的歌词配对以创建紧凑的种子集,随后训练旋律特定模型生成超过500小时的高质量中文歌唱数据。提出的DiTSinger使用具有RoPE和qk-norm的扩散变换器,并在深度、宽度和分辨率上进行扩展,还包含一种隐式对齐机制,以在嘈杂或不确定对齐下提高鲁棒性。实验表明,该方法能够实现可扩展、无需对齐和高保真度的歌唱语音合成。
Model Merging via Multi-Teacher Knowledge Distillation
Authors: Seyed Arshan Dalili, Mehrdad Mahdavi
First: 2025-12-24T17:10:44+00:00 · Latest: 2025-12-24T17:10:44+00:00
Abstract
Model merging has emerged as a lightweight alternative to joint multi-task learning (MTL), yet the generalization properties of merged models remain largely unexplored. Establishing such theoretical guarantees is non-trivial, as the merging process typically forbids access to the original training data and involves combining fine-tuned models trained on fundamentally heterogeneous data distributions. Without a principled understanding of these dynamics, current methods often rely on heuristics to approximate the optimal combination of parameters. This dependence is most critical in coefficient scaling, the weighting factors that modulate the magnitude of each fine-tuned model's contribution to the shared parameter. However, without a principled objective to guide their selection, these methods lead to brittle performance and are highly sensitive to scaling initialization. We address this gap by (i) establishing a novel flatness-aware PAC-Bayes generalization bound specifically for the model merging setting. This analysis introduces a "cross-task heterogeneity" term that formally captures the mismatch between diverse fine-tuned model priors and the target multi-task distributions. Guided by this theoretical insight, (ii) we frame model merging as multi-teacher knowledge distillation on scarce, unlabeled data. We formally demonstrate that minimizing the student-teacher Kullback-Leibler divergence directly tightens the upper bound on the merged model's excess risk. Guided by the flatness-aware bound derived, (iii) we operationalize this objective via SAMerging, a method that employs Sharpness-Aware Minimization (SAM) to find flat minima. Empirically, SAMerging establishes a new state of the art across vision and NLP benchmarks, achieving remarkable performance. The code is available at https://github.com/arshandalili/SAMerging.
中文标题/摘要
标题:多教师知识蒸馏下的模型合并
模型合并已作为一种轻量级替代方案出现,以取代联合多任务学习(MTL),但合并模型的泛化特性仍鲜有研究。建立此类理论保证并不容易,因为合并过程通常禁止访问原始训练数据,并涉及结合在根本上异质数据分布下训练的微调模型。在缺乏这些动态的原理性理解时,当前方法往往依赖于启发式方法来近似参数的最佳组合。这种方法在系数缩放中最为关键,即调节每个微调模型对共享参数贡献大小的权重因子。然而,由于缺乏指导其选择的原理性目标,这些方法会导致脆弱的性能,并且高度依赖于缩放初始化。我们通过(i) 建立一种新的基于平滑度的PAC-Bayes泛化界,专门适用于模型合并场景。此分析引入了一个“跨任务异质性”项,正式捕捉了多种微调模型先验与目标多任务分布之间的不匹配。受此理论洞察的指导,(ii) 我们将模型合并视为在稀缺未标记数据上的多教师知识蒸馏。我们正式证明,最小化学生-教师Kullback-Leibler散度直接收紧了合并模型超额风险的上界。受基于平滑度的界推导的指导,(iii) 我们通过SAMerging方法实现这一目标,该方法使用尖锐度感知最小化(SAM)来寻找平滑的极小值。实验中,SAMerging在视觉和自然语言处理基准测试中建立了新的最佳状态,实现了卓越的性能。代码可在https://github.com/arshandalili/SAMerging/ 获取。
Summary / 总结
The paper addresses the challenge of model merging, which is a lightweight alternative to joint multi-task learning but lacks theoretical guarantees. It introduces a novel flatness-aware PAC-Bayes generalization bound and frames model merging as multi-teacher knowledge distillation. The method, SAMerging, uses Sharpness-Aware Minimization to find flat minima and achieves state-of-the-art performance on vision and NLP benchmarks.
论文通过建立一个考虑平坦度的PAC-Bayes泛化界,解决了模型合并的问题,这是一种轻量级的多任务学习替代方案。该界能够捕捉不同细调模型先验与目标多任务分布之间的差异。作者将模型合并视为多教师的知识蒸馏,并提出SAMerging方法,使用尖锐度感知最小化来找到平坦的最小值,从而在视觉和自然语言处理基准测试中取得了显著的性能提升。
SMART SLM: Structured Memory and Reasoning Transformer, A Small Language Model for Accurate Document Assistance
Authors: Divij Dudeja, Mayukha Pal
First: 2025-12-24T16:59:04+00:00 · Latest: 2025-12-24T16:59:04+00:00
Abstract
The user of Engineering Manuals (EM) finds it difficult to read EM s because they are long, have a dense format which includes written documents, step by step procedures, and standard parameter lists for engineering equipment. Off the shelf transformers, especially compact ones, treat this material as a flat stream of tokens. This approach leads to confident but incorrect numeric answers and forces the models to memorize separate facts inefficiently. SMART (Structured Memory and Reasoning Transformer) offers a different and practical solution to the above problem. SMART structures its processing by using a hierarchical approach, and is based upon three main job categories (1) A syntax-aware Fact Extractor (Grammarian) Tree LSTM which extracts facts as subject relation object relations from EM sentences (2) A compact indexed memory MANN (Memory Augmented Neural Network) that indexes these Rational Subject Relation Objects as 384 dimensional vectors that are associated with the source of the information, and (3) A 6 layer Transformer that learns to fuse the previously retrieved facts into its generated response. The entire SMART model utilizes 45.51M parameters, which is 64% less than GPT-2 (124M) and 69% less than BERT (133M), and it achieves a 21.3% higher accuracy than GPT-2, indicating that SMART fits the data better with the least amount of processing requirements. SMART employs dual modes of inference an indexed fast path for known documents (sub-second answer times) and an indexed dynamic path assisted by RAGs for new uploads (FAISS Top 20 results with memory severed at 64 slots). In real world deployment, this framework leads to more well supported results with reduced hallucinations than comparable small transformer models.
中文标题/摘要
标题:SMART SLM:结构化记忆与推理变换器,一种用于准确文档辅助的小型语言模型
工程手册(EM)的用户发现阅读EMs很困难,因为它们很长,格式密集,包含书面文档、逐步程序和工程设备的标准参数列表。现成的变换器,尤其是紧凑型的,将这些材料视为一个扁平的令牌流。这种方法导致了自信但错误的数字答案,并迫使模型以低效的方式记忆单独的事实。SMART(结构化记忆与推理变换器)为上述问题提供了一种不同的且实用的解决方案。SMART通过使用分层方法来结构化其处理过程,并基于三个主要工作类别:(1)语法意识事实提取器(语法学家)树LSTM,从EM句子中提取作为主语关系对象关系的事实;(2)紧凑索引记忆MANN(记忆增强神经网络),将这些理性主语关系对象索引为384维向量,与信息来源相关联;(3)6层变换器,学习将之前检索到的事实融合到其生成的响应中。整个SMART模型使用45.51M参数,比GPT-2(124M)少64%,比BERT(133M)少69%,并且其准确率比GPT-2高21.3%,表明SMART以最少的处理需求更好地拟合数据。SMART采用双模式推理,已知文档的索引快速路径(亚秒级答案时间)和新上传文件的索引动态路径(借助RAGs的FAISS前20结果,记忆限制在64个槽位)。在实际部署中,该框架比可比的小型变换器模型产生更支持的结果,减少了幻觉。
Summary / 总结
The paper addresses the challenge of accurately processing Engineering Manuals (EM) using transformers, which often treat EMs as flat token streams, leading to incorrect numeric answers and inefficient memorization. SMART (Structured Memory and Reasoning Transformer) proposes a hierarchical approach with a syntax-aware fact extractor, a compact indexed memory, and a transformer to integrate retrieved facts. SMART uses 45.51M parameters, outperforming GPT-2 and BERT with 21.3% higher accuracy. It supports dual inference modes for known and new documents, enhancing accuracy and reducing hallucinations in real-world applications.
论文旨在解决使用小型语言模型准确处理工程手册(EM)的挑战。它引入了SMART(结构化记忆和推理变换器),采用分层方法,包括语法感知的事实提取器、紧凑的索引记忆和变换器,以提高准确性。SMART使用45.51M参数,比GPT-2高出21.3%的准确率,同时需要较少的处理。它支持已知和新上传文档的双重推理模式,从而提供更可靠的、减少幻觉的结果,优于其他小型变换器模型。
ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision
Authors: Weiqi Li, Zehao Zhang, Liang Lin, Guangrun Wang
First: 2025-12-24T16:24:18+00:00 · Latest: 2025-12-24T16:24:18+00:00
Abstract
Controllability is a fundamental requirement in video synthesis, where accurate alignment with conditioning signals is essential. Existing classifier-free guidance methods typically achieve conditioning indirectly by modeling the joint distribution of data and conditions, which often results in limited controllability over the specified conditions. Classifier-based guidance enforces conditions through an external classifier, but the model may exploit this mechanism to raise the classifier score without genuinely satisfying the intended condition, resulting in adversarial artifacts and limited effective controllability. In this paper, we propose Attention-Conditional Diffusion (ACD), a novel framework for direct conditional control in video diffusion models via attention supervision. By aligning the model's attention maps with external control signals, ACD achieves better controllability. To support this, we introduce a sparse 3D-aware object layout as an efficient conditioning signal, along with a dedicated Layout ControlNet and an automated annotation pipeline for scalable layout integration. Extensive experiments on benchmark video generation datasets demonstrate that ACD delivers superior alignment with conditioning inputs while preserving temporal coherence and visual fidelity, establishing an effective paradigm for conditional video synthesis.
中文标题/摘要
标题:ACD:通过注意力监督实现视频扩散模型的直接条件控制
在视频合成中,可控性是一个基本要求,准确对齐条件信号至关重要。现有无分类器自由引导方法通常通过建模数据和条件的联合分布间接实现条件化,这往往导致对指定条件的有限可控性。基于分类器的引导通过外部分类器强制执行条件,但模型可能会利用这种机制提高分类器分数而不真正满足预期条件,从而产生对抗性伪影并限制有效的可控性。在本文中,我们提出了注意力条件扩散(ACD),这是一种通过注意力监督实现视频扩散模型直接条件控制的新框架。通过使模型的注意力图与外部控制信号对齐,ACD 达到了更好的可控性。为此,我们引入了一种稀疏的3D感知对象布局作为高效的条件信号,以及一个专用的布局控制网和自动注释流水线以实现可扩展的布局集成。在基准视频生成数据集上的大量实验表明,ACD 在保持时间连贯性和视觉保真度的同时,实现了与条件输入的更优对齐,从而建立了条件视频合成的有效范式。
Summary / 总结
The paper proposes ACD, a novel framework for direct conditional control in video diffusion models using attention supervision. It addresses the limitations of existing methods by aligning the model's attention maps with external control signals, thereby improving controllability. ACD uses a sparse 3D-aware object layout as an efficient conditioning signal and includes a Layout ControlNet and an automated annotation pipeline. Experiments show that ACD achieves better alignment with conditioning inputs while maintaining temporal coherence and visual fidelity.
论文提出了使用注意力监督的直接条件控制方法ACD,通过使模型的注意力图与外部控制信号对齐来增强可控性。ACD 使用稀疏的3D感知对象布局作为高效的条件信号,并包含一个布局ControlNet和自动注释流水线。实验表明,ACD 在保持时间连贯性和视觉保真度的同时,能够更好地与条件输入对齐,优于现有方法在条件视频合成中的表现。
AnyAD: Unified Any-Modality Anomaly Detection in Incomplete Multi-Sequence MRI
Authors: Changwei Wu, Yifei Chen, Yuxin Du, Mingxuan Liu, Jinying Zong, Beining Wu, Jie Dong, Feiwei Qin, Yunkang Cao, Qiyuan Tian
First: 2025-12-24T16:16:09+00:00 · Latest: 2025-12-24T16:16:09+00:00
Comments: 15 pages, 8 figures
Abstract
Reliable anomaly detection in brain MRI remains challenging due to the scarcity of annotated abnormal cases and the frequent absence of key imaging modalities in real clinical workflows. Existing single-class or multi-class anomaly detection (AD) models typically rely on fixed modality configurations, require repetitive training, or fail to generalize to unseen modality combinations, limiting their clinical scalability. In this work, we present a unified Any-Modality AD framework that performs robust anomaly detection and localization under arbitrary MRI modality availability. The framework integrates a dual-pathway DINOv2 encoder with a feature distribution alignment mechanism that statistically aligns incomplete-modality features with full-modality representations, enabling stable inference even with severe modality dropout. To further enhance semantic consistency, we introduce an Intrinsic Normal Prototypes (INPs) extractor and an INP-guided decoder that reconstruct only normal anatomical patterns while naturally amplifying abnormal deviations. Through randomized modality masking and indirect feature completion during training, the model learns to adapt to all modality configurations without re-training. Extensive experiments on BraTS2018, MU-Glioma-Post, and Pretreat-MetsToBrain-Masks demonstrate that our approach consistently surpasses state-of-the-art industrial and medical AD baselines across 7 modality combinations, achieving superior generalization. This study establishes a scalable paradigm for multimodal medical AD under real-world, imperfect modality conditions. Our source code is available at https://github.com/wuchangw/AnyAD.
中文标题/摘要
标题:AnyAD:统一任意模态MRI不完整多序列异常检测
由于标注异常病例稀缺和实际临床工作流程中关键成像模态的频繁缺失,脑MRI中的可靠异常检测仍然具有挑战性。现有的单类或多类异常检测(AD)模型通常依赖于固定模态配置,需要重复训练,或者无法泛化到未见过的模态组合,限制了其临床应用的扩展性。本文提出了一种统一的任意模态AD框架,能够在任意MRI模态可用性下进行稳健的异常检测和定位。该框架结合了双路径DINOv2编码器和特征分布对齐机制,统计地将不完整模态特征与完整模态表示对齐,即使在严重模态缺失的情况下也能实现稳定的推理。为了进一步增强语义一致性,我们引入了内在正常原型(INPs)提取器和INP引导解码器,仅重建正常解剖模式,自然放大异常偏差。通过训练期间的随机模态遮蔽和间接特征补全,模型学会了适应所有模态配置而无需重新训练。在BraTS2018、MU-Glioma-Post和Pretreat-MetsToBrain-Masks数据集上的广泛实验表明,我们的方法在7种模态组合中始终超越最先进的工业和医疗AD基线,实现了更好的泛化能力。本研究为在实际不完美模态条件下建立多模态医疗AD的可扩展范式奠定了基础。我们的源代码可在https://github.com/wuchangw/AnyAD获取。
Summary / 总结
This paper addresses the challenge of reliable anomaly detection in brain MRI by proposing AnyAD, a unified framework for any-modality anomaly detection. The method uses a dual-pathway DINOv2 encoder with feature distribution alignment to handle incomplete modality data and an Intrinsic Normal Prototypes (INPs) extractor to enhance semantic consistency. Experiments on BraTS2018, MU-Glioma-Post, and Pretreat-MetsToBrain-Masks show that AnyAD outperforms existing baselines across seven modality combinations, demonstrating superior generalization and scalability under real-world conditions.
该研究提出AnyAD框架,以解决脑MRI中可靠的异常检测问题。该框架采用双路径DINOv2编码器和特征分布对齐机制处理不完整模态数据,并使用内在正常原型提取器增强语义一致性。实验结果显示,AnyAD在三个数据集上的七种模态组合中均优于现有方法,展示了在真实世界条件下出色的泛化能力和可扩展性。
ReaSeq: Unleashing World Knowledge via Reasoning for Sequential Modeling
Authors: Chuan Wang, Gaoming Yang, Han Wu, Jiakai Tang, Jiahao Yu, Jian Wu, Jianwu Hu, Junjun Zheng, Shuwen Xiao, Yeqiu Yang, Yuning Jiang, Ahjol Nurlanbek, Binbin Cao, Bo Zheng, Fangmei Zhu, Gaoming Zhou, Huimin Yi, Huiping Chu, Jin Huang, Jinzhe Shan, Kenan Cui, Longbin Li, Silu Zhou, Wen Chen, Xia Ming, Xiang Gao, Xin Yao, Xingyu Wen, Yan Zhang, Yiwen Hu, Yulin Wang, Ziheng Bao, Zongyuan Wu
First: 2025-12-24T16:06:20+00:00 · Latest: 2025-12-24T16:06:20+00:00
Abstract
Industrial recommender systems face two fundamental limitations under the log-driven paradigm: (1) knowledge poverty in ID-based item representations that causes brittle interest modeling under data sparsity, and (2) systemic blindness to beyond-log user interests that constrains model performance within platform boundaries. These limitations stem from an over-reliance on shallow interaction statistics and close-looped feedback while neglecting the rich world knowledge about product semantics and cross-domain behavioral patterns that Large Language Models have learned from vast corpora.
To address these challenges, we introduce ReaSeq, a reasoning-enhanced framework that leverages world knowledge in Large Language Models to address both limitations through explicit and implicit reasoning. Specifically, ReaSeq employs explicit Chain-of-Thought reasoning via multi-agent collaboration to distill structured product knowledge into semantically enriched item representations, and latent reasoning via Diffusion Large Language Models to infer plausible beyond-log behaviors. Deployed on Taobao's ranking system serving hundreds of millions of users, ReaSeq achieves substantial gains: >6.0% in IPV and CTR, >2.9% in Orders, and >2.5% in GMV, validating the effectiveness of world-knowledge-enhanced reasoning over purely log-driven approaches.
中文标题/摘要
标题:ReaSeq:通过推理释放世界知识以增强序列建模
工业推荐系统在日志驱动范式下面临两个根本限制:(1) 基于ID的物品表示的知识贫乏,导致在数据稀疏情况下兴趣建模脆弱,(2) 对超出日志的用户兴趣的系统性盲视,限制了模型在平台边界内的性能。这些限制源于对浅层交互统计和闭环反馈的过度依赖,而忽视了大型语言模型从大量语料库中学习到的产品语义和跨域行为模式的丰富世界知识。
为了解决这些挑战,我们引入了ReaSeq,这是一种增强推理的框架,通过显式和隐式推理利用大型语言模型中的世界知识来解决这两个限制。具体而言,ReaSeq 通过多智能体协作进行显式的链式推理,将结构化的商品知识提炼为语义丰富的物品表示,并通过扩散大型语言模型进行潜在推理,以推断可能的超出日志的行为。ReaSeq 在淘宝的排名系统中部署,服务于数亿用户,实现了显著的提升:IPV和CTR超过6.0%,订单量超过2.9%,GMV超过2.5%,验证了增强世界知识推理的有效性,超越了纯粹的日志驱动方法。
Summary / 总结
ReaSeq is a reasoning-enhanced framework that addresses the limitations of industrial recommender systems by incorporating world knowledge from Large Language Models. It uses explicit Chain-of-Thought reasoning to create semantically enriched item representations and latent reasoning to infer beyond-log user behaviors. On Taobao's ranking system, ReaSeq significantly improves key metrics, including IPV and CTR by over 6.0%, orders by over 2.9%, and GMV by over 2.5%.
ReaSeq 是一个增强推理框架,利用大型语言模型中的世界知识来解决日志驱动推荐系统的局限性。它使用显式的链式推理来丰富物品表示,并使用潜在推理来推断超出日志的行为。在淘宝的排名系统中部署后,ReaSeq 在关键指标上取得了显著提升,包括提高 IPV、CTR、订单和 GMV 超过 6.0%、2.9%、2.5% 和 2.5%。
Step-DeepResearch Technical Report
Authors: Chen Hu, Haikuo Du, Heng Wang, Lin Lin, Mingrui Chen, Peng Liu, Ruihang Miao, Tianchi Yue, Wang You, Wei Ji, Wei Yuan, Wenjin Deng, Xiaojian Yuan, Xiaoyun Zhang, Xiangyu Liu, Xikai Liu, Yanming Xu, Yicheng Cao, Yifei Zhang, Yongyao Wang, Yubo Shu, Yurong Zhang, Yuxiang Zhang, Zheng Gong, Zhichao Chang, Binyan Li, Dan Ma, Furong Jia, Hongyuan Wang, Jiayu Liu, Jing Bai, Junlan Liu, Manjiao Liu, Na Wang, Qiuping Wu, Qinxin Du, Shiwei Li, Wen Sun, Yifeng Gong, Yonglin Chen, Yuling Zhao, Yuxuan Lin, Ziqi Ren, Zixuan Wang, Aihu Zhang, Brian Li, Buyun Ma, Kang An, Li Xie, Mingliang Li, Pan Li, Shidong Yang, Xi Chen, Xiaojia Liu, Yuchu Luo, Yuan Song, YuanHao Ding, Yuanwei Liang, Zexi Li, Zhaoning Zhang, Zixin Zhang, Binxing Jiao, Daxin Jiang, Jiansheng Chen, Jing Li, Xiangyu Zhang, Yibo Zhu
First: 2025-12-23T16:32:27+00:00 · Latest: 2025-12-24T15:52:31+00:00
Abstract
As LLMs shift toward autonomous agents, Deep Research has emerged as a pivotal metric. However, existing academic benchmarks like BrowseComp often fail to meet real-world demands for open-ended research, which requires robust skills in intent recognition, long-horizon decision-making, and cross-source verification. To address this, we introduce Step-DeepResearch, a cost-effective, end-to-end agent. We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing, combined with a progressive training path from agentic mid-training to SFT and RL. Enhanced by a Checklist-style Judger, this approach significantly improves robustness. Furthermore, to bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios. Experimental results show that Step-DeepResearch (32B) scores 61.4% on Scale AI Research Rubrics. On ADR-Bench, it significantly outperforms comparable models and rivals SOTA closed-source models like OpenAI and Gemini DeepResearch. These findings prove that refined training enables medium-sized models to achieve expert-level capabilities at industry-leading cost-efficiency.
中文标题/摘要
标题:Step-DeepResearch 技术报告
随着大语言模型(LLMs)向自主代理转变,Deep Research 成为了一个关键指标。然而,现有的学术基准如 BrowseComp 通常无法满足开放研究的实际需求,这需要强大的意图识别、长期决策能力和跨源验证技能。为解决这一问题,我们引入了 Step-DeepResearch,这是一种经济高效的端到端代理。我们提出了一种基于原子能力的数据合成策略,以增强规划和报告撰写能力,并结合了从代理中期训练到SFT和RL的渐进式训练路径。通过一种清单式评判器的增强,这种方法显著提高了鲁棒性。此外,为了弥合中文领域的评估差距,我们建立了 ADR-Bench 以适应现实的深度研究场景。实验结果显示,Step-DeepResearch(32B)在 Scale AI 研究评分表上的得分为61.4%。在 ADR-Bench 上,它显著优于同类模型,并与 OpenAI 和 Gemini DeepResearch 等顶级封闭源模型相媲美。这些发现证明了精细训练使中型模型能够在行业领先的成本效益下实现专家级能力。
Summary / 总结
The research introduces Step-DeepResearch, an end-to-end agent designed to address the limitations of existing benchmarks in handling open-ended research tasks. It employs a Data Synthesis Strategy Based on Atomic Capabilities and a progressive training path from mid-training to SFT and RL, enhanced by a Checklist-style Judger. The model scores 61.4% on Scale AI Research Rubrics and outperforms comparable models on ADR-Bench, demonstrating that medium-sized models can achieve expert-level capabilities with refined training at lower costs.
研究旨在开发一个强大的自主代理以应对开放性研究任务,解决现有基准的局限性。研究引入了Step-DeepResearch,采用基于原子能力的数据合成策略和从中期训练到SFT和RL的渐进式训练路径。该代理在Scale AI研究评分表和ADR-Bench上的表现分别达到61.4%和显著优于同类模型,证明了精细训练对中型模型的有效性。
LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation
Authors: Anatoly O. Onishchenko, Alexey K. Kovalev, Aleksandr I. Panov
First: 2025-12-24T15:36:21+00:00 · Latest: 2025-12-24T15:36:21+00:00
Abstract
Methods that use Large Language Models (LLM) as planners for embodied instruction following tasks have become widespread. To successfully complete tasks, the LLM must be grounded in the environment in which the robot operates. One solution is to use a scene graph that contains all the necessary information. Modern methods rely on prebuilt scene graphs and assume that all task-relevant information is available at the start of planning. However, these approaches do not account for changes in the environment that may occur between the graph construction and the task execution. We propose LookPlanGraph - a method that leverages a scene graph composed of static assets and object priors. During plan execution, LookPlanGraph continuously updates the graph with relevant objects, either by verifying existing priors or discovering new entities. This is achieved by processing the agents egocentric camera view using a Vision Language Model. We conducted experiments with changed object positions VirtualHome and OmniGibson simulated environments, demonstrating that LookPlanGraph outperforms methods based on predefined static scene graphs. To demonstrate the practical applicability of our approach, we also conducted experiments in a real-world setting. Additionally, we introduce the GraSIF (Graph Scenes for Instruction Following) dataset with automated validation framework, comprising 514 tasks drawn from SayPlan Office, BEHAVIOR-1K, and VirtualHome RobotHow. Project page available at https://lookplangraph.github.io .
中文标题/摘要
标题:LookPlanGraph:基于VLM图增强的体感指令跟随方法
使用大型语言模型(LLM)作为规划器的方法在体感指令跟随任务中变得普遍。为了成功完成任务,LLM 必须在机器人操作的环境中得到接地。一种解决方案是使用包含所有必要信息的场景图。现代方法依赖于预先构建的场景图,并假设在规划开始时所有任务相关信息都已可用。然而,这些方法没有考虑到在图构建和任务执行之间环境可能发生的变化。我们提出了 LookPlanGraph 方法,该方法利用由静态资产和对象先验组成的场景图。在计划执行过程中,LookPlanGraph 不断通过验证现有先验或发现新实体来更新图。这通过使用视觉语言模型处理代理的主观摄像机视图来实现。我们在具有更改对象位置的 VirtualHome 和 OmniGibson 模拟环境中进行了实验,证明了 LookPlanGraph 在基于预定义静态场景图的方法中表现出色。为了展示我们方法的实际适用性,我们还在现实世界中进行了实验。此外,我们引入了 GraSIF(用于指令跟随的图场景)数据集,其中包括自动验证框架,包含来自 SayPlan Office、BEHAVIOR-1K 和 VirtualHome RobotHow 的 514 个任务。项目页面可在 https://lookplangraph.github.io 查看。
Summary / 总结
LookPlanGraph is a method that enhances embodied instruction following by using a scene graph augmented with object priors and continuous updates from a Vision Language Model. It outperforms methods relying on static scene graphs by dynamically updating the graph during task execution. Experiments in simulated and real-world environments show improved performance in handling changes in object positions. The GraSIF dataset, which includes 514 tasks, supports the validation of the approach.
研究旨在通过将大型语言模型(LLMs)与动态环境相结合,提高执行指令跟随任务的表现。LookPlanGraph 使用包含物体先验信息的场景图,并在执行过程中通过视觉语言模型处理代理的主观视角来不断更新该图。实验结果表明,LookPlanGraph 在模拟和真实世界环境中优于依赖预定义静态场景图的方法,尤其是在物体位置发生变化时表现更佳。
Assessing the Software Security Comprehension of Large Language Models
Authors: Mohammed Latif Siddiq, Natalie Sekerak, Antonio Karam, Maria Leal, Arvin Islam-Gomes, Joanna C. S. Santos
First: 2025-12-24T15:29:54+00:00 · Latest: 2025-12-24T15:29:54+00:00
Comments: Submitted to Empirical Software Engineering (EMSE) journal
Abstract
Large language models (LLMs) are increasingly used in software development, but their level of software security expertise remains unclear. This work systematically evaluates the security comprehension of five leading LLMs: GPT-4o-Mini, GPT-5-Mini, Gemini-2.5-Flash, Llama-3.1, and Qwen-2.5, using Blooms Taxonomy as a framework. We assess six cognitive dimensions: remembering, understanding, applying, analyzing, evaluating, and creating. Our methodology integrates diverse datasets, including curated multiple-choice questions, vulnerable code snippets (SALLM), course assessments from an Introduction to Software Security course, real-world case studies (XBOW), and project-based creation tasks from a Secure Software Engineering course. Results show that while LLMs perform well on lower-level cognitive tasks such as recalling facts and identifying known vulnerabilities, their performance degrades significantly on higher-order tasks that require reasoning, architectural evaluation, and secure system creation. Beyond reporting aggregate accuracy, we introduce a software security knowledge boundary that identifies the highest cognitive level at which a model consistently maintains reliable performance. In addition, we identify 51 recurring misconception patterns exhibited by LLMs across Blooms levels.
中文标题/摘要
标题:评估大型语言模型的软件安全理解能力
大型语言模型(LLMs)在软件开发中的应用越来越广泛,但它们的软件安全专业知识水平仍不清楚。本研究系统地评估了五种领先的大语言模型:GPT-4o-Mini、GPT-5-Mini、Gemini-2.5-Flash、Llama-3.1 和 Qwen-2.5 的安全理解能力,使用布卢姆分类法作为框架。我们评估了六个认知维度:记忆、理解、应用、分析、评价和创造。我们的方法论结合了多种数据集,包括精心策划的多项选择题、漏洞代码片段(SALLM)、软件安全导论课程的课程评估、真实世界的案例研究(XBOW)以及安全软件工程课程中的基于项目的创作任务。结果显示,虽然大语言模型在较低层次的认知任务,如记忆事实和识别已知漏洞方面表现良好,但在需要推理、架构评估和安全系统创建的较高层次任务上表现显著下降。除了报告总体准确性外,我们还引入了一个软件安全知识边界,以识别模型在哪个最高认知水平上能够保持可靠的表现。此外,我们确定了大语言模型在布卢姆层次上表现出的51种反复出现的认知误区模式。
Summary / 总结
This study evaluates the software security comprehension of five leading large language models (LLMs) using Blooms Taxonomy. The models are assessed on six cognitive dimensions, from remembering to creating. While LLMs excel at recalling facts and identifying known vulnerabilities, their performance drops on higher-order tasks requiring reasoning and secure system creation. The research introduces a software security knowledge boundary to identify the highest cognitive level where models consistently perform reliably. Additionally, 51 recurring misconception patterns across Blooms levels are identified.
本研究使用Blooms Taxonomy评估了五种领先的大语言模型的软件安全理解能力,评估了六个认知维度。虽然大语言模型在回忆事实和识别已知漏洞方面表现出色,但在需要推理和安全系统创建的高级任务上表现不佳。研究引入了一个软件安全知识边界,以确定模型在哪些认知水平上能够可靠地保持高性能,并揭示了跨不同认知水平的51种反复出现的误解模式。
SegMo: Segment-aligned Text to 3D Human Motion Generation
Authors: Bowen Dang, Lin Wu, Xiaohang Yang, Zheng Yuan, Zhixiang Chen
First: 2025-12-24T15:26:11+00:00 · Latest: 2025-12-24T15:26:11+00:00
Comments: The IEEE/CVF Winter Conference on Applications of Computer Vision 2026
Abstract
Generating 3D human motions from textual descriptions is an important research problem with broad applications in video games, virtual reality, and augmented reality. Recent methods align the textual description with human motion at the sequence level, neglecting the internal semantic structure of modalities. However, both motion descriptions and motion sequences can be naturally decomposed into smaller and semantically coherent segments, which can serve as atomic alignment units to achieve finer-grained correspondence. Motivated by this, we propose SegMo, a novel Segment-aligned text-conditioned human Motion generation framework to achieve fine-grained text-motion alignment. Our framework consists of three modules: (1) Text Segment Extraction, which decomposes complex textual descriptions into temporally ordered phrases, each representing a simple atomic action; (2) Motion Segment Extraction, which partitions complete motion sequences into corresponding motion segments; and (3) Fine-grained Text-Motion Alignment, which aligns text and motion segments with contrastive learning. Extensive experiments demonstrate that SegMo improves the strong baseline on two widely used datasets, achieving an improved TOP 1 score of 0.553 on the HumanML3D test set. Moreover, thanks to the learned shared embedding space for text and motion segments, SegMo can also be applied to retrieval-style tasks such as motion grounding and motion-to-text retrieval.
中文标题/摘要
标题:SegMo: 与片段对齐的文本到3D人体动作生成
从文本描述生成3D人体动作是一个重要的研究问题,在视频游戏、虚拟现实和增强现实等领域有着广泛的应用。最近的方法在序列级别上将文本描述与人体动作对齐,忽略了模态的内部语义结构。然而,动作描述和动作序列可以自然地分解为更小且语义上更连贯的片段,这些片段可以作为原子对齐单元以实现更精细的对应。受此启发,我们提出了一种新颖的SegMo框架,以实现细粒度的文本-动作对齐。我们的框架由三个模块组成:(1) 文本片段提取,将复杂的文本描述分解为按时间顺序排列的短语,每个短语代表一个简单的原子动作;(2) 动作片段提取,将完整的动作序列分割为相应的动作片段;(3) 细粒度文本-动作对齐,通过对比学习对齐文本和动作片段。广泛的实验表明,SegMo在两个广泛使用的数据集上提高了强基线,HumanML3D测试集上的TOP 1得分为0.553。此外,由于学习到的文本和动作片段共享嵌入空间,SegMo还可以应用于检索任务,如动作定位和动作到文本检索。
Summary / 总结
SegMo is a novel framework for generating 3D human motions from text, addressing the limitation of previous methods by aligning text and motion at the segment level. It consists of three modules: Text Segment Extraction, Motion Segment Extraction, and Fine-grained Text-Motion Alignment. SegMo significantly improves the alignment accuracy, achieving a TOP 1 score of 0.553 on the HumanML3D test set and demonstrating its effectiveness in retrieval tasks such as motion grounding and motion-to-text retrieval.
SegMo 是一种新颖的框架,用于从文本生成 3D 人体动作,通过在段落级别对齐文本和动作来解决先前方法的局限性。它包括三个模块:文本段落提取、动作段落提取和细粒度文本-动作对齐。SegMo 在 HumanML3D 测试集上优于强基线,达到 TOP 1 分数 0.553,并在动作定位和动作到文本检索任务中显示出有效性。
Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking
Authors: Yifan Huang, Xiaojun Jia, Wenbo Guo, Yuqiang Sun, Yihao Huang, Chong Wang, Yang Liu
First: 2025-12-24T15:25:31+00:00 · Latest: 2025-12-24T15:25:31+00:00
Comments: Accepted to FSE 2026
Abstract
Large language models (LLMs) have revolutionized software development through AI-assisted coding tools, enabling developers with limited programming expertise to create sophisticated applications. However, this accessibility extends to malicious actors who may exploit these powerful tools to generate harmful software. Existing jailbreaking research primarily focuses on general attack scenarios against LLMs, with limited exploration of malicious code generation as a jailbreak target. To address this gap, we propose SPELL, a comprehensive testing framework specifically designed to evaluate the weakness of security alignment in malicious code generation. Our framework employs a time-division selection strategy that systematically constructs jailbreaking prompts by intelligently combining sentences from a prior knowledge dataset, balancing exploration of novel attack patterns with exploitation of successful techniques. Extensive evaluation across three advanced code models (GPT-4.1, Claude-3.5, and Qwen2.5-Coder) demonstrates SPELL's effectiveness, achieving attack success rates of 83.75%, 19.38%, and 68.12% respectively across eight malicious code categories. The generated prompts successfully produce malicious code in real-world AI development tools such as Cursor, with outputs confirmed as malicious by state-of-the-art detection systems at rates exceeding 73%. These findings reveal significant security gaps in current LLM implementations and provide valuable insights for improving AI safety alignment in code generation applications.
中文标题/摘要
标题:Casting a SPELL: 句子配对探索以突破LLM限制
大规模语言模型(LLMs)通过AI辅助编程工具革新了软件开发,使具有有限编程经验的开发人员能够创建复杂的应用程序。然而,这种易用性也扩展到了恶意行为者,他们可能利用这些强大的工具生成有害软件。现有的越狱研究主要集中在针对LLMs的一般攻击场景上,对恶意代码生成作为越狱目标的探索有限。为了解决这一缺口,我们提出了SPELL,一个全面的测试框架,专门用于评估恶意代码生成中的安全对齐弱点。该框架采用时间分割选择策略,通过智能组合先前知识数据集中的句子系统地构建越狱提示,平衡探索新颖攻击模式与利用成功技术之间的关系。在三个高级代码模型(GPT-4.1、Claude-3.5和Qwen2.5-Coder)上的广泛评估表明,SPELL的有效性,分别在八类恶意代码中实现了83.75%、19.38%和68.12%的攻击成功率。生成的提示在实际的AI开发工具(如Cursor)中成功生成了恶意代码,输出被最先进的检测系统确认为恶意的比例超过73%。这些发现揭示了当前LLM实现中的重大安全漏洞,并为提高代码生成应用中的AI安全对齐提供了宝贵的见解。
Summary / 总结
The research aims to address the security risks associated with large language models (LLMs) by developing a comprehensive testing framework called SPELL. SPELL uses a time-division selection strategy to construct jailbreaking prompts by combining sentences from a prior knowledge dataset, effectively evaluating the security alignment in malicious code generation. The framework demonstrates high attack success rates of 83.75%, 19.38%, and 68.12% across three advanced code models (GPT-4.1, Claude-3.5, and Qwen2.5-Coder) for eight malicious code categories. Generated prompts produce malicious code in real-world AI development tools, confirmed as malicious by state-of-the-art detection systems, highlighting significant security gaps in LLMs.
研究旨在通过开发名为SPELL的全面测试框架来解决大型语言模型(LLMs)在AI辅助编码工具中的安全风险问题。SPELL使用时间分割选择策略,通过结合先前知识数据集中的句子来构建越狱提示,有效评估恶意代码生成的安全对齐情况。该框架在三个先进的代码模型(GPT-4.1、Claude-3.5和Qwen2.5-Coder)上展示了针对八个恶意代码类别高达83.75%、19.38%和68.12%的攻击成功率,输出被最先进的检测系统确认为恶意,确认率为超过73%。这项工作揭示了当前LLM实现中的重大安全漏洞,并为提高代码生成应用中的AI安全对齐提供了宝贵的见解。
Knowledge Augmentation via Synthetic Data: A Framework for Real-World ECG Image Classification
Authors: Xiaoyu Wang, Ramesh Nadarajah, Zhiqiang Zhang, David Wong
First: 2025-07-29T16:16:17+00:00 · Latest: 2025-12-24T15:24:31+00:00
Comments: 10 pages, 6 figures
Abstract
In real-world clinical practice, electrocardiograms (ECGs) are often captured and shared as photographs. However, publicly available ECG data, and thus most related research, relies on digital signals. This has led to a disconnect in which computer assisted interpretation of ECG cannot easily be applied to ECG images. The emergence of high-fidelity synthetic data generators has introduced practical alternatives by producing realistic, photo-like, ECG images derived from the digital signal that could help narrow this divide.
To address this, we propose a novel knowledge augmentation framework that uses synthetic data generated from multiple sources to provide generalisable and accurate interpretation of ECG photographs. Our framework features two key contributions. First, we introduce a robust pre-processing pipeline designed to remove background artifacts and reduces visual differences between images. Second, we implement a two-stage training strategy: a Morphology Learning Stage, where the model captures broad morphological features from visually different, scan-like synthetic data, followed by a Task-Specific Adaptation Stage, where the model is fine-tuned on the photo-like target data.
We tested the model on the British Heart Foundation Challenge dataset, to classify five common ECG findings: myocardial infarction (MI), atrial fibrillation, hypertrophy, conduction disturbance, and ST/T changes. Our approach, built upon the ConvNeXt backbone, outperforms a single-source training baseline and achieved \textbf{1st} place in the challenge with an macro-AUROC of \textbf{0.9677}. These results suggest that incorporating morphology learning from heterogeneous sources offers a more robust and generalizable paradigm than conventional single-source training.
中文标题/摘要
标题:通过合成数据增强知识:一种用于心电图图像分类的框架
在实际临床实践中,心电图(ECGs)通常被拍摄并共享为照片。然而,公开可用的ECG数据,以及相关的大多数研究,依赖于数字信号。这导致了一个断层,即计算机辅助的ECG解释难以应用于ECG图像。高保真合成数据生成器的出现为解决这一问题提供了实际的替代方案,通过从数字信号中生成逼真、照片般的ECG图像来缩小这一差距。
为此,我们提出了一种新颖的知识增强框架,该框架使用来自多个来源生成的合成数据,以提供ECG照片的通用和准确解释。我们的框架有两个关键贡献。首先,我们引入了一个稳健的预处理管道,旨在去除背景伪影并减少图像之间的视觉差异。其次,我们实施了一种两阶段训练策略:形态学习阶段,模型从视觉上不同的扫描样式的合成数据中捕获广泛的形态特征,随后是任务特定适应阶段,模型在照片般的目标数据上进行微调。
我们在英国心脏基金会挑战数据集上测试了该模型,用于分类五种常见的心电图发现:心肌梗死(MI)、心房颤动、肥厚、传导障碍和ST/T变化。我们的方法基于ConvNeXt骨干网络,优于单一来源训练基线,并在挑战中获得了第1名,宏AUROC为0.9677。这些结果表明,从异质来源引入形态学习提供了一种比传统单一来源训练更稳健和通用的范式。
Summary / 总结
This paper addresses the challenge of applying computer-assisted ECG interpretation to images rather than digital signals. It proposes a knowledge augmentation framework using synthetic data to bridge this gap. The framework includes a pre-processing pipeline to remove artifacts and a two-stage training strategy: morphology learning from synthetic data and task-specific adaptation on photo-like images. The model, based on ConvNeXt, achieved a macro-AUROC of 0.9677 and won the British Heart Foundation Challenge, outperforming a single-source training baseline.
论文解决了将计算机辅助的心电图(ECG)解释应用于图像而非数字信号的挑战。它提出了一种使用合成数据的知识增强框架来弥合这一差距。该框架包括一个预处理管道以去除伪影和两阶段训练策略:从合成数据中学习形态学特征和在照片般的目标数据上进行微调。基于ConvNeXt的模型在英国心脏基金会挑战赛中获得第1名,宏AUROC为0.9677,优于单源训练基线。
Rethinking Memory in LLM based Agents: Representations, Operations, and Emerging Topics
Authors: Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, Jeff Z. Pan
First: 2025-05-01T17:31:33+00:00 · Latest: 2025-12-24T15:24:04+00:00
Abstract
Memory is fundamental to large language model (LLM)-based agents, but existing surveys emphasize application-level use (e.g., personalized dialogue), while overlooking the atomic operations governing memory dynamics. This work categorizes memory into parametric (implicit in model weights) and contextual (explicit external data, structured/unstructured) forms, and defines six core operations: Consolidation, Updating, Indexing, Forgetting, Retrieval, and Condensation. Mapping these dimensions reveals four key research topics: long-term, long-context, parametric modification, and multi-source memory. The taxonomy provides a structured view of memory-related research, benchmarks, and tools, clarifying functional interactions in LLM-based agents and guiding future advancements. The datasets, papers, and tools are publicly available at https://github.com/Elvin-Yiming-Du/Survey_Memory_in_AI.
中文标题/摘要
标题:基于LLM的代理重新思考记忆:表示、操作及新兴主题
记忆是基于大型语言模型(LLM)的代理的基础,但现有的综述侧重于应用层面的使用(例如,个性化对话),而忽视了管理记忆动态的基本操作。本文将记忆分为参数化(隐含在模型权重中)和上下文化(显式的外部数据,结构化/非结构化)两种形式,并定义了六种核心操作:巩固、更新、索引、遗忘、检索和凝练。将这些维度映射出来揭示了四个关键研究主题:长期记忆、长上下文记忆、参数化修改和多源记忆。该分类提供了关于记忆相关研究、基准和工具的结构化视角,澄清了LLM代理中的功能交互,并指导未来的发展。相关数据集、论文和工具可在https://github.com/Elvin-Yiming-Du/Survey_Memory_in_AI 获取。
Summary / 总结
This work addresses the need for a deeper understanding of memory in large language model (LLM)-based agents by categorizing memory into parametric and contextual forms and defining six core operations: Consolidation, Updating, Indexing, Forgetting, Retrieval, and Condensation. The research identifies four key research topics: long-term memory, long-context memory, parametric modification, and multi-source memory. A taxonomy is provided to clarify functional interactions in LLM-based agents, supporting future advancements in the field. The datasets, papers, and tools are publicly available.
本文通过将记忆分为参数性和上下文性两种形式,并定义了六种核心操作:巩固、更新、索引、遗忘、检索和凝练,来解决对大型语言模型(LLM)基础代理中的记忆理解不足的问题。研究确定了四个关键研究主题:长期、长上下文、参数修改和多源记忆。该分类提供了记忆相关研究、基准和工具的结构化视图,增强了对LLM基础代理中功能交互的理解,并指导未来的发展。相关数据集、论文和工具可在https://github.com/Elvin-Yiming-Du/Survey_Memory_in_AI 获取。
MiST: Understanding the Role of Mid-Stage Scientific Training in Developing Chemical Reasoning Models
Authors: Andres M Bran, Tong Xie, Shai Pranesh, Jeffrey Meng, Xuan Vu Nguyen, Jeremy Goumaz, David Ming Segura, Ruizhi Xu, Dongzhan Zhou, Wenjie Zhang, Bram Hoex, Philippe Schwaller
First: 2025-12-24T15:15:18+00:00 · Latest: 2025-12-24T15:15:18+00:00
Abstract
Large Language Models can develop reasoning capabilities through online fine-tuning with rule-based rewards. However, recent studies reveal a critical constraint: reinforcement learning succeeds only when the base model already assigns non-negligible probability to correct answers -- a property we term 'latent solvability'. This work investigates the emergence of chemical reasoning capabilities and what these prerequisites mean for chemistry. We identify two necessary conditions for RL-based chemical reasoning: 1) Symbolic competence, and 2) Latent chemical knowledge. We propose mid-stage scientific training (MiST): a set of mid-stage training techniques to satisfy these, including data-mixing with SMILES/CIF-aware pre-processing, continued pre-training on 2.9B tokens, and supervised fine-tuning on 1B tokens. These steps raise the latent-solvability score on 3B and 7B models by up to 1.8x, and enable RL to lift top-1 accuracy from 10.9 to 63.9% on organic reaction naming, and from 40.6 to 67.4% on inorganic material generation. Similar results are observed for other challenging chemical tasks, while producing interpretable reasoning traces. Our results define clear prerequisites for chemical reasoning training and highlight the broader role of mid-stage training in unlocking reasoning capabilities.
中文标题/摘要
标题:MiST:理解中期科学训练在发展化学推理模型中的作用
大型语言模型可以通过基于规则的奖励进行在线微调来发展推理能力。然而,最近的研究揭示了一个关键限制:强化学习仅在基础模型已对正确答案赋予非可忽略概率时才能成功——我们称这一特性为“潜在可解性”。本研究探讨了化学推理能力的出现及其先决条件对化学领域意味着什么。我们确定了基于强化学习的化学推理的两个必要条件:1)符号能力,2)潜在化学知识。我们提出了中期科学训练(MiST):一系列中期训练技术以满足这些条件,包括数据混合、SMILES/CIF意识预处理、继续预训练29亿个标记以及监督微调1亿个标记。这些步骤将3B和7B模型的潜在可解性得分提高至1.8倍,并使强化学习在有机反应命名中的顶级准确率从10.9%提升至63.9%,在无机材料生成中的顶级准确率从40.6%提升至67.4%。对于其他具有挑战性的化学任务,也观察到了类似的结果,同时生成了可解释的推理痕迹。我们的研究结果定义了化学推理训练的明确先决条件,并突显了中期训练在解锁推理能力中的更广泛作用。
Summary / 总结
This study explores the role of mid-stage scientific training (MiST) in developing chemical reasoning capabilities in large language models. It identifies two prerequisites: symbolic competence and latent chemical knowledge. The researchers propose MiST techniques such as data-mixing, continued pre-training, and supervised fine-tuning, which significantly enhance the models' latent solvability scores, enabling better performance in tasks like organic reaction naming and inorganic material generation, with accuracy improvements up to 63.9% and 67.4%, respectively.
本研究探讨了通过中期科学训练(MiST)在大型语言模型中发展化学推理能力的方法。研究确定了两个先决条件:符号能力与潜在的化学知识。MiST 包括数据混合、SMILES/CIF 意识预处理、持续预训练和监督微调等技术。这些方法显著提高了模型的潜在可解性,使有机反应命名和无机材料生成任务的 top-1 准确率分别提高到 63.9% 和 67.4%。