arXiv 论文速递

2025-12-30 03:24
Snapshot: 20251230_0324
See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning
Authors: Shuoshuo Zhang, Yizhen Zhang, Jingjing Fu, Lei Song, Jiang Bian, Yujiu Yang, Rui Wang
First: 2025-12-26T18:59:47+00:00 · Latest: 2025-12-26T18:59:47+00:00
Abstract
Large vision-language models (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.
中文标题/摘要
标题:见少而明:双向感知塑造用于多模态推理
大型视觉-语言模型(VLMs)通常从中间视觉提示中受益,这些提示要么通过外部工具注入,要么在推理过程中作为潜在视觉标记生成,但这些机制仍然忽略了细微的视觉证据(例如图表中的多边形线),跨领域泛化能力差,并且在推理时间成本高。在本文中,我们提出了双向感知塑造(BiPS),它将问题条件下的掩码视图转换为双向的看哪里信号,在训练过程中塑造感知。BiPS 首先在原始图像和保留仅与问题相关区域的证据保留视图之间施加KL一致性约束,鼓励粗略但完整的支持像素覆盖。然后在原始图像和一个关键像素被遮蔽的证据消除视图之间施加KL分离约束,该视图不再支持原始答案,从而避免仅从文本回答(即,仅从文本回答)并强制细粒度的视觉依赖。在八个基准测试中,BiPS 将 Qwen2.5-VL-7B 的性能平均提升 8.2%,并在未见过的数据集和图像类型上展示了强大的跨域泛化能力。
Summary / 总结
The research aims to improve the fine-grained visual evidence utilization and generalization of large vision-language models. Bi-directional Perceptual Shaping (BiPS) is proposed to transform question-conditioned masked views into bidirectional where-to-look signals, shaping perception during training. This method enhances the model's ability to focus on relevant visual regions and avoid text-only shortcuts. Across eight benchmarks, BiPS improves Qwen2.5-VL-7B by 8.2% on average and demonstrates strong out-of-domain generalization to unseen datasets and image types.
研究旨在提高大型视觉-语言模型对细粒度视觉证据的利用和泛化能力。提出了双向感知塑造(BiPS)方法,将问题条件下的遮罩视图转换为双向的注视信号,在训练过程中塑造感知。该方法增强了模型聚焦相关视觉区域的能力,并避免了仅依赖文本的捷径。在八个基准测试中,BiPS将Qwen2.5-VL-7B的性能平均提升了8.2%,并在未见过的数据集和图像类型上展示了强大的跨域泛化能力。
ProEdit: Inversion-based Editing From Prompts Done Right
Authors: Zhi Ouyang, Dian Zheng, Xiao-Ming Wu, Jian-Jian Jiang, Kun-Yu Lin, Jingke Meng, Wei-Shi Zheng
First: 2025-12-26T18:59:14+00:00 · Latest: 2025-12-26T18:59:14+00:00
Comments: Equal contributions from first two authors. Project page: https://isee-laboratory.github.io/ProEdit/ Code: https://github.com/iSEE-Laboratory/ProEdit
Abstract
Inversion-based visual editing provides an effective and training-free way to edit an image or a video based on user instructions. Existing methods typically inject source image information during the sampling process to maintain editing consistency. However, this sampling strategy overly relies on source information, which negatively affects the edits in the target image (e.g., failing to change the subject's atributes like pose, number, or color as instructed). In this work, we propose ProEdit to address this issue both in the attention and the latent aspects. In the attention aspect, we introduce KV-mix, which mixes KV features of the source and the target in the edited region, mitigating the influence of the source image on the editing region while maintaining background consistency. In the latent aspect, we propose Latents-Shift, which perturbs the edited region of the source latent, eliminating the influence of the inverted latent on the sampling. Extensive experiments on several image and video editing benchmarks demonstrate that our method achieves SOTA performance. In addition, our design is plug-and-play, which can be seamlessly integrated into existing inversion and editing methods, such as RF-Solver, FireFlow and UniEdit.
中文标题/摘要
标题:ProEdit:基于反转的编辑方法实现精准编辑
基于反转的视觉编辑提供了一种有效且无需训练的方法,可以根据用户指令编辑图像或视频。现有方法通常在采样过程中注入源图像信息以保持编辑一致性。然而,这种采样策略过度依赖源信息,这会负面影响目标图像中的编辑效果(例如,无法按照指令改变主体的姿态、数量或颜色)。在本文中,我们提出ProEdit以在注意力和潜在方面解决这一问题。在注意力方面,我们引入了KV-mix,它在编辑区域混合源和目标的KV特征,减轻了源图像对编辑区域的影响,同时保持背景一致性。在潜在方面,我们提出了Latents-Shift,它扰动源潜在的编辑区域,消除了反转潜在对采样的影响。在几个图像和视频编辑基准上的广泛实验表明,我们的方法达到了SOTA性能。此外,我们的设计是即插即用的,可以无缝集成到现有的反转和编辑方法中,如RF-Solver、FireFlow和UniEdit。
Summary / 总结
ProEdit addresses the issue of overly relying on source image information in inversion-based visual editing, which negatively affects the edits in the target image. It introduces KV-mix to mix KV features of the source and target in the edited region, and Latents-Shift to perturb the edited region of the source latent, maintaining background consistency. Experiments show that ProEdit achieves state-of-the-art performance and can be easily integrated into existing methods like RF-Solver, FireFlow, and UniEdit.
ProEdit 解决了基于反转的视觉编辑中过度依赖源图像信息的问题,这会负面影响目标图像中的编辑效果。它通过引入 KV-mix 混合编辑区域中源图像和目标图像的 KV 特征,以及 Latents-Shift 扰动源图像的编辑区域的潜在特征,来保持背景一致性。实验表明,ProEdit 达到了最先进的性能,并且可以无缝集成到现有的方法如 RF-Solver、FireFlow 和 UniEdit 中。
Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications
Authors: Shengkun Cui, Rahul Krishna, Saurabh Jha, Ravishankar K. Iyer
First: 2025-12-26T18:56:18+00:00 · Latest: 2025-12-26T18:56:18+00:00
Abstract
Cloud incidents pose major operational challenges in production, with unresolved production cloud incidents cost on average over $2M per hour. Prior research identifies code- and configuration-related issues as the predominant category of root causes in cloud incidents. This paper introduces PRAXIS, an orchestrator that manages and deploys an agentic workflow for diagnosing code- and configuration-caused cloud incidents. PRAXIS employs an LLM-driven structured traversal over two types of graph: (1) a service dependency graph (SDG) that captures microservice-level dependencies; and (2) a hammock-block program dependence graph (PDG) that captures code-level dependencies for each microservice. Together, these graphs encode microservice- and code-level dependencies and the LLM acts as a traversal policy over these graphs, moving between services and code dependencies to localize and explain failures. Compared to state-of-the-art ReAct baselines, PRAXIS improves RCA accuracy by up to 3.1x while reducing token consumption by 3.8x. PRAXIS is demonstrated on a set of 30 comprehensive real-world incidents that is being compiled into an RCA benchmark.
中文标题/摘要
标题:代理结构化图遍历在云应用代码相关事故根本原因分析中的应用
云事故在生产中带来了重大的运营挑战,未解决的生产云事故平均每小时成本超过200万美元。先前的研究指出,代码和配置相关问题是云事故中主要的根本原因类别。本文介绍了PRAXIS,一种管理并部署代理工作流以诊断代码和配置引起的云事故的协调器。PRAXIS利用LLM驱动的结构化遍历两种类型的图:(1)服务依赖图(SDG),捕捉微服务级别的依赖关系;(2) hammock-block程序依赖图(PDG),捕捉每个微服务的代码级别依赖关系。这些图共同编码了微服务和代码级别的依赖关系,而LLM作为这些图上的遍历策略,通过在服务和代码依赖关系之间移动来定位和解释故障。与最先进的ReAct基线相比,PRAXIS将根本原因分析的准确性提高了3.1倍,同时将令牌消耗减少了3.8倍。PRAXIS已在30个全面的现实世界事故上进行了演示,这些事故正在被编译成根本原因分析基准。
Summary / 总结
This paper addresses the challenge of diagnosing cloud incidents caused by code and configuration issues, which are costly and frequent. It presents PRAXIS, which uses an LLM-driven structured traversal over service dependency and program dependence graphs to diagnose these issues. PRAXIS improves root cause analysis accuracy by up to 3.1 times and reduces token consumption by 3.8 times compared to existing methods. The system is tested on 30 real-world incidents, showing its effectiveness in diagnosing cloud application failures.
本文针对由代码和配置问题引起的云故障进行诊断的挑战,这些问题既频繁又昂贵。文中提出了一种名为PRAXIS的协调器,利用LLM驱动的结构化遍历服务依赖图和 hammock-block 程序依赖图来诊断和解释故障。与现有方法相比,PRAXIS在根因分析准确性上提高了3.1倍,同时使用了3.8倍少的令牌。该方法在30个真实世界案例上进行了验证,展示了其在云应用程序根因分析中的有效性。
Explainable Multimodal Regression via Information Decomposition
Authors: Zhaozhao Ma, Shujian Yu
First: 2025-12-26T18:07:18+00:00 · Latest: 2025-12-26T18:07:18+00:00
Comments: Project Page: https://github.com/zhaozhaoma/PIDReg
Abstract
Multimodal regression aims to predict a continuous target from heterogeneous input sources and typically relies on fusion strategies such as early or late fusion. However, existing methods lack principled tools to disentangle and quantify the individual contributions of each modality and their interactions, limiting the interpretability of multimodal fusion. We propose a novel multimodal regression framework grounded in Partial Information Decomposition (PID), which decomposes modality-specific representations into unique, redundant, and synergistic components. The basic PID framework is inherently underdetermined. To resolve this, we introduce inductive bias by enforcing Gaussianity in the joint distribution of latent representations and the transformed response variable (after inverse normal transformation), thereby enabling analytical computation of the PID terms. Additionally, we derive a closed-form conditional independence regularizer to promote the isolation of unique information within each modality. Experiments on six real-world datasets, including a case study on large-scale brain age prediction from multimodal neuroimaging data, demonstrate that our framework outperforms state-of-the-art methods in both predictive accuracy and interpretability, while also enabling informed modality selection for efficient inference. Implementation is available at https://github.com/zhaozhaoma/PIDReg.
中文标题/摘要
标题:基于信息分解的可解释多模态回归
多模态回归旨在从异构输入源中预测连续目标,通常依赖于早期或晚期融合策略。然而,现有方法缺乏将每种模态及其相互作用的个体贡献分离和量化的原则性工具,限制了多模态融合的可解释性。我们提出了一种基于部分信息分解(PID)的新颖多模态回归框架,该框架将模态特定的表示分解为独特的、冗余的和协同的成分。基本的PID框架本质上是欠定的。为了解决这个问题,我们通过在潜在表示和转换后的响应变量(经过逆正态变换)的联合分布中引入高斯性诱导偏置,从而使得PID项的解析计算成为可能。此外,我们推导出一个封闭形式的条件独立性正则化项,以促进每个模态中独特信息的隔离。在六个真实世界数据集上的实验,包括大规模脑年龄预测的多模态神经影像学数据案例研究,表明我们的框架在预测准确性和可解释性方面均优于最先进的方法,同时还能实现高效的模态选择。项目实现可从https://github.com/zhaozhaoma/PIDReg获取。
Summary / 总结
The paper addresses the need for interpretable multimodal regression by proposing a framework based on Partial Information Decomposition (PID). It decomposes modality-specific representations into unique, redundant, and synergistic components and introduces inductive bias to resolve the underdetermined nature of PID. Experiments on six real-world datasets show that the proposed method outperforms existing methods in both predictive accuracy and interpretability, and it also facilitates informed modality selection for efficient inference.
该论文提出了一种基于部分信息分解(PID)的新型多模态回归框架,通过将模态特定表示分解为独特的、冗余的和协同的组件来提高可解释性。通过在潜在表示和转换后的响应变量的联合分布中强制高斯性,该框架能够进行PID项的解析计算,并促进每个模态内独特信息的隔离。实验结果显示,该方法在预测准确性和可解释性方面均优于现有方法,并有助于高效推理中的模态选择。
Rewards-based image analysis in microscopy
Authors: Kamyar Barakati, Yu Liu, Utkarsh Pratiush, Boris N. Slautin, Sergei V. Kalinin
First: 2025-02-23T19:19:38+00:00 · Latest: 2025-12-26T18:04:07+00:00
Comments: 41 pages, 11 figures
Abstract
Imaging and hyperspectral data analysis is central to progress across biology, medicine, chemistry, and physics. The core challenge lies in converting high-resolution or high-dimensional datasets into interpretable representations that enable insight into the underlying physical or chemical properties of a system. Traditional analysis relies on expert-designed, multistep workflows, such as denoising, feature extraction, clustering, dimensionality reduction, and physics-based deconvolution, or on machine learning (ML) methods that accelerate individual steps. Both approaches, however, typically demand significant human intervention, including hyperparameter tuning and data labeling. Achieving the next level of autonomy in scientific imaging requires designing effective reward-based workflows that guide algorithms toward best data representation for human or automated decision-making. Here, we discuss recent advances in reward-based workflows for image analysis, which capture key elements of human reasoning and exhibit strong transferability across various tasks. We highlight how reward-driven approaches enable a shift from supervised black-box models toward explainable, unsupervised optimization on the examples of Scanning Probe and Electron Microscopies. Such reward-based frameworks are promising for a broad range of applications, including classification, regression, structure-property mapping, and general hyperspectral data processing.
中文标题/摘要
标题:基于奖励的显微镜图像分析
成像和超光谱数据分析是生物学、医学、化学和物理学进步的核心。核心挑战在于将高分辨率或高维数据集转换为可解释的表示形式,以揭示系统中潜在的物理或化学性质。传统分析依赖于专家设计的多步骤工作流,如去噪、特征提取、聚类、降维和基于物理的反卷积,或者依赖于加速各个步骤的机器学习(ML)方法。然而,这两种方法通常都需要大量的人工干预,包括超参数调整和数据标注。在科学成像中实现更高水平的自主性需要设计有效的奖励驱动工作流,以引导算法向最佳数据表示方向发展,以供人类或自动化决策使用。在这里,我们讨论了图像分析中奖励驱动工作流的最新进展,这些工作流捕捉了人类推理的关键元素,并在各种任务之间表现出强大的可转移性。我们强调奖励驱动方法如何从监督的黑盒模型转向可解释的、无监督的优化,例如扫描探针显微镜和电子显微镜。这样的奖励驱动框架对于广泛的应用具有前景,包括分类、回归、结构-性质映射和一般超光谱数据处理。
Summary / 总结
The research aims to enhance autonomous scientific imaging by developing reward-based workflows that reduce human intervention in image analysis. The method involves designing algorithms that optimize data representation through reward signals, enabling transferability across different tasks. Key findings show that these reward-driven approaches improve the explainability and transferability of image analysis in Scanning Probe and Electron Microscopies, facilitating tasks such as classification and structure-property mapping without extensive human supervision.
论文探讨了将高分辨率或高维成像和超光谱数据转换为可解释表示以进行科学研究的挑战。它介绍了基于奖励的工作流,可以引导算法向最佳数据表示方向发展,减少人类干预的需求。主要发现包括这些基于奖励的方法能够实现可解释的无监督优化,并在扫描探针显微镜和电子显微镜等任务中的分类、回归和结构-性质映射等方面表现出强大的跨任务转移能力。
A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting
Authors: Shuyu Gan, Renxiang Wang, James Mooney, Dongyeop Kang
Venue: 1st Workshop on GenAI, Agents, and the Future of VIS (VIS x GenAI), November 2025, Vienna, Austria
First: 2025-12-26T18:02:12+00:00 · Latest: 2025-12-26T18:02:12+00:00
Comments: 3 pages, 3 figures; Accepted by 1st Workshop on GenAI, Agents and the Future of VIS as Mini-challenge paper and win the Honorable Mention award. Submit number is 7597 and the paper is archived on the workshop website: https://visxgenai.github.io/subs-2025/7597/7597-doc.pdf
Abstract
Automating end-to-end data science pipeline with AI agents still stalls on two gaps: generating insightful, diverse visual evidence and assembling it into a coherent, professional report. We present A2P-Vis, a two-part, multi-agent pipeline that turns raw datasets into a high-quality data-visualization report. The Data Analyzer orchestrates profiling, proposes diverse visualization directions, generates and executes plotting code, filters low-quality figures with a legibility checker, and elicits candidate insights that are automatically scored for depth, correctness, specificity, depth and actionability. The Presenter then orders topics, composes chart-grounded narratives from the top-ranked insights, writes justified transitions, and revises the document for clarity and consistency, yielding a coherent, publication-ready report. Together, these agents convert raw data into curated materials (charts + vetted insights) and into a readable narrative without manual glue work. We claim that by coupling a quality-assured Analyzer with a narrative Presenter, A2P-Vis operationalizes co-analysis end-to-end, improving the real-world usefulness of automated data analysis for practitioners. For the complete dataset report, please see: https://www.visagent.org/api/output/f2a3486d-2c3b-4825-98d4-5af25a819f56.
中文标题/摘要
标题:A2P-Vis:分析师到展示者自主管道,用于生成和报告视觉洞察
尽管使用AI代理自动化端到端的数据科学管道仍然存在两个瓶颈:生成有洞察力且多样的视觉证据以及将其组织成连贯且专业的报告。我们提出了A2P-Vis,这是一种两部分的多代理管道,能够将原始数据集转化为高质量的数据可视化报告。数据分析师协调数据概要分析,提出多样化的可视化方向,生成并执行绘图代码,使用可读性检查器过滤低质量的图表,并激发候选洞察,这些洞察将自动评分以评估其深度、正确性、具体性、深度和可操作性。展示者随后按主题排序,从排名最高的洞察中构建基于图表的故事叙述,撰写有说服力的过渡,并修订文档以提高清晰度和一致性,从而生成一个连贯且适合出版的报告。这些代理共同将原始数据转化为精心策划的材料(图表+验证过的洞察)和可读的故事叙述,无需手动粘合工作。我们声称,通过结合质量保证的分析师和叙述性的展示者,A2P-Vis 实现了从头到尾的协同分析,提高了自动化数据分析在实际应用中的实用性。完整的数据报告请参见:https://www.visagent.org/api/output/f2a3486d-2c3b-4825-98d4-5af25a819f56。
Summary / 总结
A2P-Vis is a two-part pipeline that automates the generation of high-quality data-visualization reports by using an Analyzer to propose diverse visualizations, generate plotting code, and score insights, followed by a Presenter to order topics, compose narratives, and revise the document for clarity. The key findings show that A2P-Vis effectively converts raw datasets into coherent, publication-ready reports without manual intervention, enhancing the real-world usefulness of automated data analysis for practitioners.
A2P-Vis 是一个两部分的自动化管道,用于生成高质量的数据可视化报告。它包括一个数据分析器,负责数据概览、提出可视化方向、生成图表、过滤低质量图表并评分以确保质量。然后,呈现器组织主题,编写基于图表的故事,并修订文档以提高清晰度。最终结果是一个无需人工干预的连贯且可发表的报告。该系统旨在端到端地实现协同分析,增强自动化数据分析的实际应用价值。
Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis
Authors: Duygu Altinok
First: 2025-12-26T18:02:09+00:00 · Latest: 2025-12-26T18:02:09+00:00
Comments: under review by Springer
Abstract
Evaluating the performance of various model architectures, such as transformers, large language models (LLMs), and other NLP systems, requires comprehensive benchmarks that measure performance across multiple dimensions. Among these, the evaluation of natural language understanding (NLU) is particularly critical as it serves as a fundamental criterion for assessing model capabilities. Thus, it is essential to establish benchmarks that enable thorough evaluation and analysis of NLU abilities from diverse perspectives. While the GLUE benchmark has set a standard for evaluating English NLU, similar benchmarks have been developed for other languages, such as CLUE for Chinese, FLUE for French, and JGLUE for Japanese. However, no comparable benchmark currently exists for the Turkish language. To address this gap, we introduce TrGLUE, a comprehensive benchmark encompassing a variety of NLU tasks for Turkish. In addition, we present SentiTurca, a specialized benchmark for sentiment analysis. To support researchers, we also provide fine-tuning and evaluation code for transformer-based models, facilitating the effective use of these benchmarks. TrGLUE comprises Turkish-native corpora curated to mirror the domains and task formulations of GLUE-style evaluations, with labels obtained through a semi-automated pipeline that combines strong LLM-based annotation, cross-model agreement checks, and subsequent human validation. This design prioritizes linguistic naturalness, minimizes direct translation artifacts, and yields a scalable, reproducible workflow. With TrGLUE, our goal is to establish a robust evaluation framework for Turkish NLU, empower researchers with valuable resources, and provide insights into generating high-quality semi-automated datasets.
中文标题/摘要
标题:介绍TrGLUE和SentiTurca:面向土耳其语通用语言理解和情感分析的综合基准
评估各种模型架构(如变换器、大型语言模型(LLMs)和其他NLP系统)的表现需要跨多个维度进行全面基准测试,其中自然语言理解(NLU)的评估尤为重要,因为它是评估模型能力的基本标准。因此,建立能够从多角度进行彻底评估和分析的基准测试是必要的。虽然GLUE基准为英语NLU评估设定了标准,但其他语言也开发了类似的基准,如CLUE(中文)、FLUE(法语)和JGLUE(日语),但目前尚无适用于土耳其语的类似基准。为填补这一空白,我们引入了TrGLUE,这是一个涵盖多种土耳其语NLU任务的综合基准。此外,我们还提出了SentiTurca,一个专门用于情感分析的基准。为了支持研究人员,我们还提供了针对变换器模型的微调和评估代码,便于有效使用这些基准。TrGLUE包含经过精心策划的土耳其本土语料库,旨在模仿GLUE风格评估的领域和任务形式,标签通过结合强LLM注释、跨模型一致性检查和后续的人工验证的半自动化管道获得。这种设计优先考虑语言自然性,减少直接翻译的痕迹,并提供可扩展、可重复的工作流程。通过TrGLUE,我们的目标是建立一个稳健的土耳其语NLU评估框架,为研究人员提供有价值的资源,并提供生成高质量半自动化数据集的见解。
Summary / 总结
The paper introduces TrGLUE and SentiTurca, benchmarks for evaluating natural language understanding and sentiment analysis in Turkish. TrGLUE includes a variety of NLU tasks with Turkish-native corpora, while SentiTurca focuses on sentiment analysis. The benchmarks are designed to mirror the GLUE benchmark for English, using a semi-automated pipeline for annotation to ensure linguistic naturalness and reproducibility. Key findings include the successful creation of these benchmarks, which will enable thorough evaluation of NLU models in Turkish and support research in this area.
该论文介绍了针对土耳其语自然语言理解和情感分析的综合基准TrGLUE和SentiTurca。受建立标准化评估框架的需求驱动,作者开发了TrGLUE,其中包括多种NLU任务,并使用半自动化标注过程确保语言自然性。主要发现包括成功创建了一个可扩展且可重复的基准,其风格类似于GLUE评估,能够对土耳其语NLU能力进行全面分析。SentiTurca是一个专门的情感分析基准,旨在支持研究人员评估情感分析模型。作者还提供了针对基于变换器模型的微调和评估代码,以促进这些基准的使用。
Yume-1.5: A Text-Controlled Interactive World Generation Model
Authors: Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang
First: 2025-12-26T17:52:49+00:00 · Latest: 2025-12-26T17:52:49+00:00
Abstract
Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.
中文标题/摘要
标题:Yume-1.5:一种文本控制的交互世界生成模型
近期的方法表明,使用扩散模型生成交互和可探索的世界具有很大的潜力。然而,这些方法大多面临着参数量过大、依赖于长时间推理步骤以及历史上下文迅速增长等关键挑战,这严重限制了实时性能,并缺乏文本控制生成能力。为了解决这些挑战,我们提出了一种名为\method的新框架,该框架旨在从单张图片或文本提示生成逼真、交互和连续的世界。\method通过一个精心设计的框架实现这一点,该框架支持基于键盘的生成世界探索。该框架包括三个核心组件:(1)结合统一上下文压缩和线性注意力的长视频生成框架;(2)由双向注意力蒸馏和增强的文本嵌入方案驱动的实时流式加速策略;(3)一种用于生成世界事件的文本控制方法。我们已在附录中提供了代码库。
Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity
Authors: Cong Qi, Hanzhang Fang, Tianxing Hu, Siqi Jiang, Wei Zhi
First: 2025-04-22T20:34:47+00:00 · Latest: 2025-12-26T17:42:50+00:00
Abstract
Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, but its complexity, which is marked by high dimensionality, sparsity, and batch effects, which poses major computational challenges. Transformer-based models have made significant advances in this domain but are often limited by their quadratic complexity and suboptimal handling of long-range dependencies. In this work, we introduce GeneMamba, a scalable and efficient foundation model for single-cell transcriptomics built on state space modeling. Leveraging the Bi-Mamba architecture, GeneMamba captures bidirectional gene context with linear-time complexity, offering substantial computational gains over transformer baselines. The model is pretrained on nearly 30 million cells and incorporates biologically informed objectives, including pathway-aware contrastive loss and rank-based gene encoding. We evaluate GeneMamba across diverse tasks, including multi-batch integration, cell type annotation, and gene-gene correlation, demonstrating strong performance, interpretability, and robustness. These results position GeneMamba as a practical and powerful alternative to transformer-based methods, advancing the development of biologically grounded, scalable tools for large-scale single-cell data analysis.
中文标题/摘要
标题:单细胞数据的双向Mamba:具有生物忠实性的高效上下文学习
单细胞RNA测序(scRNA-seq)能够实现细胞异质性的高分辨率分析,但其复杂性,表现为高维度、稀疏性和批次效应,这提出了重大的计算挑战。基于变换器的模型在这一领域取得了显著进展,但往往受限于其二次复杂性和对长距离依赖的次优处理。在本工作中,我们引入了GeneMamba,这是一种基于状态空间建模的可扩展且高效的单细胞转录组学基础模型。利用双向Mamba架构,GeneMamba以线性时间复杂度捕获双向基因上下文,相比变换器基线模型提供了显著的计算增益。该模型在近3000万个细胞上进行了预训练,并结合了生物启发的目标,包括路径感知对比损失和基于排名的基因编码。我们在多种任务上评估了GeneMamba,包括多批次整合、细胞类型注释和基因-基因相关性,展示了其强大的性能、可解释性和鲁棒性。这些结果将GeneMamba定位为变换器基线方法的实用且强大的替代方案,推动了生物基础的、可扩展的大型单细胞数据分析工具的发展。
Summary / 总结
GeneMamba is a scalable foundation model for single-cell transcriptomics that uses a Bi-Mamba architecture to capture bidirectional gene context with linear-time complexity, addressing the computational challenges of high-dimensional and sparse scRNA-seq data. Pretrained on nearly 30 million cells, GeneMamba incorporates biologically informed objectives and demonstrates strong performance in tasks such as multi-batch integration, cell type annotation, and gene-gene correlation, showing interpretability and robustness compared to transformer-based methods.
GeneMamba 是一种用于单细胞转录组学的可扩展基础模型,使用 Bi-Mamba 架构捕获双向基因上下文,以线性时间复杂度解决高维和稀疏 scRNA-seq 数据的计算挑战。GeneMamba 在近 3000 万个细胞上进行预训练,并结合生物启发的目标,展示了在多批次整合、细胞类型注释和基因-基因相关性等任务中的强大性能,表现出可解释性和鲁棒性,相比基于变换器的方法更具优势。
Unifying Learning Dynamics and Generalization in Transformers Scaling Law
Authors: Chiwun Yang
First: 2025-12-26T17:20:09+00:00 · Latest: 2025-12-26T17:20:09+00:00
Abstract
The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources. Yet, while empirically validated, its theoretical underpinnings remain poorly understood. This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system, then approximates this process to kernel behaviors. Departing from prior toy-model analyses, we rigorously analyze stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions. Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data, especially during the optimization process. We establish a theoretical upper bound on excess risk characterized by a distinct phase transition. In the initial optimization phase, the excess risk decays exponentially relative to the computational cost ${\sf C}$. However, once a specific resource allocation threshold is crossed, the system enters a statistical phase, where the generalization error follows a power-law decay of $Θ(\mathsf{C}^{-1/6})$. Beyond this unified framework, our theory derives isolated scaling laws for model size, training time, and dataset size, elucidating how each variable independently governs the upper bounds of generalization.
Summary / 总结
This work aims to understand the theoretical underpinnings of the scaling law in Large Language Models (LLMs) by formalizing the learning dynamics of transformer-based models as an ODE system and approximating it to kernel behaviors. The study rigorously analyzes stochastic gradient descent training for multi-layer transformers on sequence-to-sequence data, showing that the excess risk decays exponentially in the initial optimization phase but transitions to a power-law decay once a specific resource allocation threshold is crossed. The theory establishes distinct scaling laws for model size, training time, and dataset size, providing insights into how each variable affects the upper bounds of generalization.
该研究旨在通过将变压器模型的学习动力学形式化为常微分方程系统并近似为核行为,来理解大规模语言模型(LLM)的缩放定律的理论基础。研究对多层变压器在序列到序列数据上的随机梯度下降训练进行了严格的分析,表明在初始优化阶段,过拟合风险呈指数衰减,但一旦达到特定的资源分配阈值,系统进入统计阶段,过拟合误差遵循幂律衰减。理论还为模型规模、训练时间和数据集大小建立了独立的缩放定律,揭示了每个变量如何影响泛化能力的上限。
Context as a Tool: Context Management for Long-Horizon SWE-Agents
Authors: Shukai Liu, Jian Yang, Bo Jiang, Yizhi Li, Jinyang Guo, Xianglong Liu, Bryan Dai
First: 2025-12-26T17:15:47+00:00 · Latest: 2025-12-26T17:15:47+00:00
Abstract
Agents based on large language models have recently shown strong potential on real-world software engineering (SWE) tasks that require long-horizon interaction with repository-scale codebases. However, most existing agents rely on append-only context maintenance or passively triggered compression heuristics, which often lead to context explosion, semantic drift, and degraded reasoning in long-running interactions. We propose CAT, a new context management paradigm that elevates context maintenance to a callable tool integrated into the decision-making process of agents. CAT formalizes a structured context workspace consisting of stable task semantics, condensed long-term memory, and high-fidelity short-term interactions, and enables agents to proactively compress historical trajectories into actionable summaries at appropriate milestones. To support context management for SWE-agents, we propose a trajectory-level supervision framework, CAT-GENERATOR, based on an offline data construction pipeline that injects context-management actions into complete interaction trajectories. Using this framework, we train a context-aware model, SWE-Compressor. Experiments on SWE-Bench-Verified demonstrate that SWE-Compressor reaches a 57.6% solved rate and significantly outperforms ReAct-based agents and static compression baselines, while maintaining stable and scalable long-horizon reasoning under a bounded context budget.
中文标题/摘要
标题:上下文作为工具:面向长期视角的SWE代理上下文管理
基于大型语言模型的代理最近在需要与仓库规模代码库进行长期交互的真实世界软件工程(SWE)任务中展现了强大的潜力。然而,大多数现有代理依赖于追加式的上下文维护或被动触发的压缩启发式方法,这通常会导致上下文膨胀、语义漂移和长期交互中的推理退化。我们提出了一种新的上下文管理范式CAT,将上下文维护提升为集成到代理决策过程中的可调用工具。CAT形式化了一个结构化的上下文工作空间,包括稳定的任务语义、浓缩的长期记忆和高保真的短期交互,并使代理能够在适当的时间节点主动压缩历史轨迹为可操作的摘要。为了支持SWE代理的上下文管理,我们基于离线数据构建管道提出了一个轨迹级监督框架CAT-GENERATOR,该框架将上下文管理动作注入完整的交互轨迹中。使用该框架,我们训练了一个上下文感知模型SWE-Compressor。在SWE-Bench-Verified上的实验表明,SWE-Compressor达到了57.6%的解决率,并显著优于基于ReAct的代理和静态压缩基线,同时在上下文预算受限的情况下保持了稳定和可扩展的长期推理。
Summary / 总结
The paper addresses the challenge of context management for long-horizon software engineering (SWE) agents based on large language models. It introduces CAT, a new context management paradigm that integrates context maintenance into the decision-making process. CAT enables agents to proactively compress historical interactions into actionable summaries, improving reasoning stability and scalability. Experiments show that the context-aware model SWE-Compressor, trained using a trajectory-level supervision framework, outperforms existing methods on SWE-Bench-Verified, achieving a 57.6% solved rate and maintaining stable long-horizon reasoning within a bounded context budget.
论文针对大型语言模型代理在与代码库长时间交互中出现的上下文膨胀和语义漂移问题,提出了CAT,一种将上下文管理集成到决策过程中的新范式,并提出了基于离线数据构建管道的CAT-GENERATOR框架,用于训练上下文感知模型SWE-Compressor。实验结果显示,SWE-Compressor的解决率为57.6%,并显著优于基于ReAct的代理和静态压缩基线,同时在限定的上下文预算下保持稳定的推理能力。
Recursive Training Loops in LLMs: How training data properties modulate distribution shift in generated data?
Authors: Grgur Kovač, Jérémy Perez, Rémy Portelas, Peter Ford Dominey, Pierre-Yves Oudeyer
Venue: EMNLP 2025 Oral
First: 2025-04-04T14:41:41+00:00 · Latest: 2025-12-26T17:12:34+00:00
Comments: Accepted to EMNLP 2025 (Oral), Source Code: https://github.com/flowersteam/ce_llms
Abstract
Large language models (LLMs) are increasingly used in the creation of online content, creating feedback loops as subsequent generations of models will be trained on this synthetic data. Such loops were shown to lead to distribution shifts - models misrepresenting the true underlying distributions of human data (also called model collapse). However, how human data properties affect such shifts remains poorly understood. In this paper, we provide the first empirical examination of the effect of such properties on the outcome of recursive training. We first confirm that using different human datasets leads to distribution shifts of different magnitudes. Through exhaustive manipulation of dataset properties combined with regression analyses, we then identify a set of properties predicting distribution shift magnitudes. Lexical diversity is found to amplify these shifts, while semantic diversity and data quality mitigate them. Furthermore, we find that these influences are highly modular: data scrapped from a given internet domain has little influence on the content generated for another domain. Finally, experiments on political bias reveal that human data properties affect whether the initial bias will be amplified or reduced. Overall, our results portray a novel view, where different parts of internet may undergo different types of distribution shift.
中文标题/摘要
标题:LLMs中的递归训练循环:人类数据属性如何影响生成数据的分布偏移?
大型语言模型(LLMs)越来越多地用于在线内容的创作,随着模型的迭代,后续版本将基于这些合成数据进行训练。研究表明,这种循环会导致分布偏移——模型错误地代表了人类数据的真实分布(也称为模型崩溃)。然而,人类数据属性如何影响这种偏移仍然知之甚少。在本文中,我们首次对这些属性对递归训练结果的影响进行了实证研究。我们首先确认使用不同的人类数据集会导致不同幅度的分布偏移。通过详尽地操纵数据集属性并结合回归分析,我们确定了一组预测分布偏移幅度的属性。词汇多样性被发现会放大这些偏移,而语义多样性和数据质量则会减轻它们。此外,我们发现这些影响是高度模块化的:从给定互联网域抓取的数据对另一个域生成的内容几乎没有影响。最后,关于政治偏见的实验揭示了人类数据属性如何影响初始偏见是被放大还是被减轻。总体而言,我们的结果描绘了一种新的视角,即互联网的不同部分可能会经历不同类型的分布偏移。
Summary / 总结
This paper investigates how properties of human training data affect distribution shifts in large language models (LLMs) during recursive training loops. By manipulating dataset properties and using regression analyses, the study identifies that lexical diversity amplifies distribution shifts, while semantic diversity and data quality mitigate them. The research also finds that these influences are modular, with data from a specific internet domain having little impact on content generated for another domain. Additionally, experiments on political bias show that human data properties determine whether initial bias is amplified or reduced.
该研究探讨了人类训练数据的特性如何影响大型语言模型(LLMs)在递归训练过程中产生的分布偏移。通过操控数据集的特性并进行回归分析,研究发现词汇多样性会放大这些偏移,而语义多样性和数据质量则会减轻它们。这些影响是模块化的,一个互联网领域的数据对另一个领域的内容生成几乎没有影响。实验还表明,人类数据的特性会影响初始偏见的放大或减弱。总体而言,研究揭示了递归训练循环中复杂动态的新视角。
Improving Multi-turn Task Completion in Task-Oriented Dialog Systems via Prompt Chaining and Fine-Grained Feedback
Authors: Moghis Fereidouni, Md Sajid Ahmed, Adib Mosharrof, A. B. Siddique
First: 2025-02-18T21:36:19+00:00 · Latest: 2025-12-26T16:51:41+00:00
Comments: 7 pages
Abstract
Task-oriented dialog (TOD) systems facilitate users in accomplishing complex, multi-turn tasks through natural language. While instruction-tuned large language models (LLMs) have demonstrated strong performance on a range of single-turn NLP tasks, they often struggle with reliable multi-turn task completion in TOD settings, particularly when generating API calls required to interact with external systems. To address this, we introduce RealTOD, a novel framework that improves LLM-based TOD systems through (1) prompt chaining and (2) fine-grained feedback. Prompt chaining enables zero-shot generalization to new domains by automatically synthesizing a schema-aligned in-context example for the target task. Fine-grained feedback verifies each generated API call against the domain schema, identifies specific errors, and provides targeted correction prompts. To evaluate task completion reliability, we introduce full API Call Accuracy as a robust metric, along with detailed sub-metrics to capture common failure modes. We conduct extensive experiments on the SGD and BiTOD benchmarks using four LLMs. RealTOD improves Full API accuracy, surpassing state-of-the-art AutoTOD by 37.10% on SGD and supervised learning-based baseline SimpleTOD by 10.32% on BiTOD. Human evaluations further confirm that LLMs integrated with RealTOD achieve superior task completion, fluency, and informativeness compared to existing methods.
中文标题/摘要
标题:通过提示链和细粒度反馈提高面向任务对话系统多轮任务完成能力
面向任务的对话(TOD)系统通过自然语言帮助用户完成复杂的多轮任务。虽然指令调优的大语言模型(LLMs)在多种单轮NLP任务上表现出色,但在TOD场景中可靠地完成多轮任务时,特别是在生成与外部系统交互所需的API调用时,它们常常表现不佳。为了解决这一问题,我们提出了RealTOD,这是一种通过(1)提示链和(2)细粒度反馈来改进基于LLM的TOD系统的新型框架。提示链通过自动合成与目标任务对齐的上下文示例,实现对新领域的零样本泛化。细粒度反馈验证每个生成的API调用是否符合领域模式,识别具体错误并提供针对性的纠正提示。为了评估任务完成的可靠性,我们引入了全API调用准确率作为稳健的度量标准,并提供了详细的子度量来捕捉常见的失败模式。我们在SGD和BiTOD基准上使用四种LLM进行了广泛的实验。RealTOD提高了全API准确率,比AutoTOD在SGD上的表现高出37.10%,比监督学习基线SimpleTOD在BiTOD上的表现高出10.32%。进一步的人类评估表明,与现有方法相比,集成RealTOD的LLM在任务完成、流畅性和信息量方面表现更优。
Summary / 总结
This paper addresses the challenge of reliable multi-turn task completion in task-oriented dialog systems by proposing RealTOD, which uses prompt chaining and fine-grained feedback. Prompt chaining helps LLMs generalize to new domains, while fine-grained feedback ensures the accuracy of generated API calls. The study evaluates these methods on the SGD and BiTOD benchmarks, showing that RealTOD significantly improves full API call accuracy, outperforming existing methods by a substantial margin.
研究旨在通过解决指令调优的大语言模型(LLMs)在生成准确API调用方面的局限性,来提升任务导向对话系统中的多轮任务完成能力。研究引入了RealTOD框架,该框架通过生成模式对齐的上下文示例来实现零样本泛化,并通过细粒度反馈验证和纠正API调用。在SGD和BiTOD基准上的实验表明,RealTOD显著提高了全API调用准确性,分别在SGD和BiTOD上优于现有方法37.10%和10.32%。
fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding
Authors: Yuxiang Wei, Yanteng Zhang, Xi Xiao, Chengxuan Qian, Tianyang Wang, Vince D. Calhoun
First: 2025-11-24T20:26:59+00:00 · Latest: 2025-12-26T15:54:52+00:00
Comments: Code are available: https://github.com/yuxiangwei0808/fMRI-LM
Abstract
Recent advances in multimodal large language models (LLMs) have enabled unified reasoning across images, audio, and video, but extending such capability to brain imaging remains largely unexplored. Bridging this gap is essential to link neural activity with semantic cognition and to develop cross-modal brain representations. To this end, we present fMRI-LM, a foundational model that bridges functional MRI (fMRI) and language through a three-stage framework. In Stage 1, we learn a neural tokenizer that maps fMRI into discrete tokens embedded in a language-consistent space. In Stage 2, a pretrained LLM is adapted to jointly model fMRI tokens and text, treating brain activity as a sequence that can be temporally predicted and linguistically described. To overcome the lack of natural fMRI-text pairs, we construct a large descriptive corpus that translates diverse imaging-based features into structured textual descriptors, capturing the low-level organization of fMRI signals. In Stage 3, we perform multi-task, multi-paradigm instruction tuning to endow fMRI-LM with high-level semantic understanding, supporting diverse downstream applications. Across various benchmarks, fMRI-LM achieves strong zero-shot and few-shot performance, and adapts efficiently with parameter-efficient tuning (LoRA), establishing a scalable pathway toward a language-aligned, universal model for structural and semantic understanding of fMRI.
中文标题/摘要
标题:fMRI-LM:迈向语言对齐的fMRI理解通用基础模型
近年来,多模态大型语言模型(LLMs)的发展使图像、音频和视频之间的统一推理成为可能,但将这种能力扩展到脑成像领域仍处于探索阶段。弥合这一差距对于将神经活动与语义认知联系起来以及开发跨模态脑表示至关重要。为此,我们提出了fMRI-LM,这是一种通过三阶段框架将功能性磁共振成像(fMRI)与语言连接起来的基础模型。在第一阶段,我们学习了一个神经分词器,将fMRI映射到嵌入语言一致空间的离散标记中。在第二阶段,我们对预训练的LLM进行调整,使其能够同时建模fMRI标记和文本,将脑活动视为可以进行时间预测和语言描述的序列。为了解决自然fMRI-文本对的缺乏,我们构建了一个大型描述性语料库,将基于成像的各种特征翻译成结构化的文本描述,捕捉fMRI信号的低级组织。在第三阶段,我们进行多任务、多范式指令微调,赋予fMRI-LM高层次的语义理解,支持多种下游应用。在各种基准测试中,fMRI-LM实现了强大的零样本和少样本性能,并通过参数高效微调(LoRA)高效适应,建立了语言对齐的通用模型的可扩展途径,用于结构和语义理解fMRI。
Summary / 总结
fMRI-LM is a foundational model that bridges fMRI and language using a three-stage framework. It first learns a neural tokenizer to map fMRI into discrete tokens, then adapts a pretrained LLM to model fMRI tokens and text, and finally performs multi-task instruction tuning for high-level semantic understanding. fMRI-LM demonstrates strong zero-shot and few-shot performance across various benchmarks and can be efficiently fine-tuned with parameter-efficient methods.
fMRI-LM 是一个通过三阶段框架将 fMRI 与语言连接起来的基础模型。它首先学习一个神经分词器将 fMRI 映射为离散的令牌,然后将预训练的 LLM 调整为同时建模 fMRI 令牌和文本,最后进行多任务指令调优以实现高层次的语义理解。fMRI-LM 在各种基准测试中展示了强大的零样本和少量样本性能,并且可以通过参数高效的方法进行有效微调。
Scaling Adversarial Training via Data Selection
Authors: Youran Ye, Dejin Wang, Ajinkya Bhandare
First: 2025-12-26T15:50:33+00:00 · Latest: 2025-12-26T15:50:33+00:00
Comments: 6 pages. Conference workshop paper
Abstract
Projected Gradient Descent (PGD) is a strong and widely used first-order adversarial attack, yet its computational cost scales poorly, as all training samples undergo identical iterative inner-loop optimization despite contributing unequally to robustness. Motivated by this inefficiency, we propose \emph{Selective Adversarial Training}, which perturbs only a subset of critical samples in each minibatch. Specifically, we introduce two principled selection criteria: (1) margin-based sampling, which prioritizes samples near the decision boundary, and (2) gradient-matching sampling, which selects samples whose gradients align with the dominant batch optimization direction. Adversarial examples are generated only for the selected subset, while the remaining samples are trained cleanly using a mixed objective. Experiments on MNIST and CIFAR-10 show that the proposed methods achieve robustness comparable to, or even exceeding, full PGD adversarial training, while reducing adversarial computation by up to $50\%$, demonstrating that informed sample selection is sufficient for scalable adversarial robustness.
中文标题/摘要
标题:通过数据选择扩展对抗训练
投影梯度下降(PGD)是一种强大且广泛使用的对抗攻击方法,尽管所有训练样本在每次迭代中都经历了相同的内循环优化,但它们对鲁棒性贡献不均,导致其计算成本高。受此低效性启发,我们提出了\emph{选择性对抗训练},该方法仅在每个小批量中扰动一部分关键样本。具体而言,我们引入了两种原则性的选择标准:(1)基于边界的采样,优先选择靠近决策边界的样本;(2)梯度匹配采样,选择梯度与批量优化方向对齐的样本。仅对选定的子集生成对抗样本,而其余样本则使用混合目标进行干净训练。在MNIST和CIFAR-10上的实验表明,所提出的方法在鲁棒性方面与完整的PGD对抗训练相当,甚至超过后者,同时将对抗计算量减少高达50%,证明了有信息量的样本选择足以实现可扩展的对抗鲁棒性。
Periodic Asynchrony: An Effective Method for Accelerating Reinforcement Learning for Large Language Models
Authors: Jian Lu
First: 2025-11-24T08:22:50+00:00 · Latest: 2025-12-26T15:48:38+00:00
Abstract
Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention, with growing efforts to reproduce and apply it. However, training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are typically deployed on the same devices. While this approach reduces costs through resource consolidation, its synchronous execution imposes a computational coupling that prevents concurrent inference and training. In this study, we are returning to the strategy of separating inference and training deployment, and by introducing improvements in the data loader, we transform the conventional synchronous architecture into a periodically asynchronous framework, which allows for demand-driven, independent, and elastic scaling of each component, while the accuracy of the algorithm remains completely equivalent to the synchronization method, with both belonging to the on-policy strategy. It is worth emphasizing that we apply a unified tri-model architecture in the training phase, and we also proposed a shared-prompt attention mask to reduce repetitive computation. In practice, these works have achieved at least a threefold overall performance improvement in RL training on NPU platforms, indicating its potential for widespread application.
中文标题/摘要
标题:周期异步性:一种加速大型语言模型强化学习的有效方法
自GRPO算法问世以来,强化学习(RL)引起了越来越多的关注,人们不断尝试重现和应用它。然而,训练效率仍然是一个关键挑战。在主流的RL框架中,推理和训练通常部署在同一设备上。虽然这种方法通过资源整合降低了成本,但其同步执行方式导致了计算耦合,阻碍了推理和训练的同时进行。在本研究中,我们重新采用了分离推理和训练部署的策略,并通过改进数据加载器,将传统的同步架构转变为周期异步框架,从而实现了需求驱动、独立和弹性扩展每个组件的能力,同时算法的准确性与同步方法完全等价,两者均属于在线策略。值得一提的是,在训练阶段我们应用了一致的三模型架构,并提出了共享提示注意掩码以减少重复计算。在实践中,这些工作在NPU平台上实现了至少三倍的整体训练性能提升,表明其具有广泛的应用潜力。
Summary / 总结
This study addresses the training efficiency challenge in reinforcement learning for large language models by proposing a periodically asynchronous framework. The method involves separating inference and training deployment and improving the data loader to enable independent and elastic scaling of each component. The key experimental finding is a threefold improvement in overall performance during RL training on NPU platforms, demonstrating the method's potential for widespread application in this domain.
该研究通过提出一个周期性异步框架来解决大型语言模型在强化学习中的训练效率问题,该方法将推理和训练部署分离,允许每个组件独立和弹性扩展。该方法保持了策略一致性策略的准确性,同时在NPU平台上实现了至少三倍的整体性能提升。还引入了一种统一的三模型架构和共享提示注意掩码以减少重复计算。
Prefill vs. Decode Bottlenecks: SRAM-Frequency Tradeoffs and the Memory-Bandwidth Ceiling
Authors: Hannah Atmer, Yuan Yao, Thiemo Voigt, Stefanos Kaxiras
First: 2025-12-26T15:42:29+00:00 · Latest: 2025-12-26T15:42:29+00:00
Abstract
Energy consumption dictates the cost and environmental impact of deploying Large Language Models. This paper investigates the impact of on-chip SRAM size and operating frequency on the energy efficiency and performance of LLM inference, focusing on the distinct behaviors of the compute-bound prefill and memory-bound decode phases. Our simulation methodology combines OpenRAM for energy modeling, LLMCompass for latency simulation, and ScaleSIM for systolic array operational intensity. Our findings show that total energy use is predominantly determined by SRAM size in both phases, with larger buffers significantly increasing static energy due to leakage, which is not offset by corresponding latency benefits. We quantitatively explore the memory-bandwidth bottleneck, demonstrating that while high operating frequencies reduce prefill latency, their positive impact on memory-bound decode latency is capped by the external memory bandwidth. Counter-intuitively, high compute frequency can reduce total energy by reducing execution time and consequently decreasing static energy consumption more than the resulting dynamic power increase. We identify an optimal hardware configuration for the simulated workload: high operating frequencies (1200MHz-1400MHz) and a small local buffer size of 32KB to 64KB. This combination achieves the best energy-delay product, balancing low latency with high energy efficiency. Furthermore, we demonstrate how memory bandwidth acts as a performance ceiling, and that increasing compute frequency only yields performance gains up to the point where the workload becomes memory-bound. This analysis provides concrete architectural insights for designing energy-efficient LLM accelerators, especially for datacenters aiming to minimize their energy overhead.
中文标题/摘要
标题:预填充 vs. 解码瓶颈:SRAM-频率权衡与内存带宽上限
能耗决定了部署大型语言模型的成本和环境影响。本文研究了片上SRAM大小和工作频率对LLM推理的能量效率和性能的影响,重点关注计算受限的预填充阶段和内存受限的解码阶段的差异行为。我们的仿真方法结合了OpenRAM进行能量建模、LLMCompass进行延迟仿真和ScaleSIM进行阵列操作强度仿真。我们的研究结果表明,总能耗主要由两个阶段的SRAM大小决定,较大的缓冲区显著增加了由于泄漏导致的静态能耗,而这种增加并未因相应的延迟减少而得到补偿。我们定量探讨了内存带宽瓶颈,表明虽然高工作频率可以减少预填充延迟,但其对内存受限解码延迟的积极影响受到外部内存带宽的限制。出乎意料的是,高计算频率可以通过减少执行时间从而降低静态能耗,进而减少动态功率增加,从而降低总能耗。我们确定了模拟工作负载的最佳硬件配置:高工作频率(1200MHz-1400MHz)和较小的本地缓冲区大小(32KB到64KB)。这种组合实现了最佳的能量延迟积,平衡了低延迟与高能量效率。此外,我们展示了内存带宽作为性能天花板的作用,并表明增加计算频率只能在工作负载变为内存受限之前提供性能增益。此分析为设计节能LLM加速器提供了具体的架构见解,特别是对于旨在最小化其能耗的数据中心而言。
Summary / 总结
This paper investigates the impact of on-chip SRAM size and operating frequency on the energy efficiency and performance of Large Language Model inference. The study uses a combination of OpenRAM, LLMCompass, and ScaleSIM for simulation. Key findings include that total energy use is largely determined by SRAM size, with larger buffers increasing static energy due to leakage. High operating frequencies reduce prefill latency but have a limited impact on memory-bound decode latency due to external memory bandwidth constraints. Surprisingly, high compute frequency can reduce total energy by decreasing static energy consumption more than dynamic power increases. The optimal configuration is found to be high operating frequencies (1200MHz-1400MHz) and a small local buffer size of 32KB to 64KB, which balances low latency with high energy efficiency.
该研究探讨了片上SRAM大小和操作频率对大型语言模型推理的能效和性能的影响。研究使用OpenRAM、LLMCompass和ScaleSIM进行仿真。主要发现包括总能耗主要由SRAM大小决定,较大的缓冲区会因泄漏增加静态能耗。高操作频率可以减少预填充延迟,但对外部内存带宽的限制使得其对内存受限的解码延迟影响有限。令人意外的是,高计算频率可以通过减少静态能耗来降低总能耗,超过动态功率增加的影响。最优配置为高操作频率(1200MHz-1400MHz)和小本地缓冲区大小(32KB到64KB),这可以平衡低延迟和高能效。
Real-Time Streamable Generative Speech Restoration with Flow Matching
Authors: Simon Welker, Bunlong Lay, Maris Hillemann, Tal Peer, Timo Gerkmann
First: 2025-12-22T14:41:17+00:00 · Latest: 2025-12-26T15:39:59+00:00
Comments: This work has been submitted to the IEEE for possible publication
Abstract
Diffusion-based generative models have greatly impacted the speech processing field in recent years, exhibiting high speech naturalness and spawning a new research direction. Their application in real-time communication is, however, still lagging behind due to their computation-heavy nature involving multiple calls of large DNNs. Here, we present Stream$.$FM, a frame-causal flow-based generative model with an algorithmic latency of 32 milliseconds (ms) and a total latency of 48 ms, paving the way for generative speech processing in real-time communication. We propose a buffered streaming inference scheme and an optimized DNN architecture, show how learned few-step numerical solvers can boost output quality at a fixed compute budget, explore model weight compression to find favorable points along a compute/quality tradeoff, and contribute a model variant with 24 ms total latency for the speech enhancement task. Our work looks beyond theoretical latencies, showing that high-quality streaming generative speech processing can be realized on consumer GPUs available today. Stream$.$FM can solve a variety of speech processing tasks in a streaming fashion: speech enhancement, dereverberation, codec post-filtering, bandwidth extension, STFT phase retrieval, and Mel vocoding. As we verify through comprehensive evaluations and a MUSHRA listening test, Stream$.$FM establishes a state-of-the-art for generative streaming speech restoration, exhibits only a reasonable reduction in quality compared to a non-streaming variant, and outperforms our recent work (Diffusion Buffer) on generative streaming speech enhancement while operating at a lower latency.
中文标题/摘要
标题:实时流式生成语音恢复与流匹配
基于扩散的生成模型近年来在语音处理领域产生了重大影响,表现出高度的语音自然度,并开辟了新的研究方向。然而,由于其计算密集型特性,涉及多次调用大型DNN,它们在实时通信中的应用仍然滞后。 在这里,我们提出了Stream$.$FM,这是一种具有32毫秒(ms)算法延迟和48 ms总延迟的基于流的生成模型,为实时通信中的生成语音处理铺平了道路。我们提出了一种缓冲流式推理方案和优化的DNN架构,展示了如何在固定计算预算下通过学习的几步数值求解器提升输出质量,探索了模型权重压缩以在计算/质量权衡中找到有利点,并贡献了一个总延迟为24 ms的模型变体用于语音增强任务。 本研究超越了理论延迟,展示了当前可用的消费级GPU上可以实现高质量的流式生成语音处理。Stream$.$FM 可以以流式方式解决各种语音处理任务:语音增强、去混响、编解码后滤波、带宽扩展、STFT相位恢复和梅尔声码。通过全面评估和MUSHRA听觉测试,Stream$.$FM 在生成流式语音恢复方面达到了最先进的水平,与非流式变体相比仅表现出合理的质量下降,并在生成流式语音增强方面优于我们最近的工作(扩散缓冲),同时具有更低的延迟。
Summary / 总结
The research aims to improve the real-time application of diffusion-based generative models in speech processing by reducing computational latency. Stream$.$FM, a frame-causal flow-based generative model, achieves an algorithmic latency of 32 milliseconds and a total latency of 48 milliseconds. The model uses a buffered streaming inference scheme and an optimized DNN architecture, and demonstrates high-quality speech processing for tasks such as enhancement, dereverberation, and bandwidth extension, with results comparable to non-streaming variants and outperforming previous work on generative streaming speech enhancement.
研究旨在解决在实时语音处理中应用基于扩散的生成模型的计算挑战。Stream$.$FM 是一种具有 48 毫秒总延迟的帧因果流基模型,通过缓冲流式推理方案和优化的 DNN 架构实现。关键发现包括使用学习的数值求解器提高输出质量、通过模型权重压缩实现计算/质量权衡,并且有一个 24 毫秒延迟的变体用于语音增强。综合评估和 MUSHRA 测试证实,Stream$.$FM 在生成流式语音恢复方面达到了最先进的性能,与非流式变体相比,只有适度的质量降低。
Toward Secure and Compliant AI: Organizational Standards and Protocols for NLP Model Lifecycle Management
Authors: Sunil Arora, John Hastings
First: 2025-12-26T15:28:20+00:00 · Latest: 2025-12-26T15:28:20+00:00
Comments: 9 pages, 2 tables, 1 figure
Abstract
Natural Language Processing (NLP) systems are increasingly used in sensitive domains such as healthcare, finance, and government, where they handle large volumes of personal and regulated data. However, these systems introduce distinct risks related to security, privacy, and regulatory compliance that are not fully addressed by existing AI governance frameworks. This paper introduces the Secure and Compliant NLP Lifecycle Management Framework (SC-NLP-LMF), a comprehensive six-phase model designed to ensure the secure operation of NLP systems from development to retirement. The framework, developed through a systematic PRISMA-based review of 45 peer-reviewed and regulatory sources, aligns with leading standards, including NIST AI RMF, ISO/IEC 42001:2023, the EU AI Act, and MITRE ATLAS. It integrates established methods for bias detection, privacy protection (differential privacy, federated learning), secure deployment, explainability, and secure model decommissioning. A healthcare case study illustrates how SC-NLP-LMF detects emerging terminology drift (e.g., COVID-related language) and guides compliant model updates. The framework offers organizations a practical, lifecycle-wide structure for developing, deploying, and maintaining secure and accountable NLP systems in high-risk environments.
中文标题/摘要
标题:向安全合规的人工智能迈进:NLP模型生命周期管理的组织标准与协议
自然语言处理(NLP)系统在医疗保健、金融和政府等敏感领域中越来越广泛地使用,处理大量个人和受监管的数据。然而,这些系统引入了与安全、隐私和监管合规相关的独特风险,现有的AI治理框架并未充分解决这些问题。本文介绍了安全合规NLP生命周期管理框架(SC-NLP-LMF),这是一种全面的六阶段模型,旨在确保从开发到退役的NLP系统的安全运行。该框架通过系统性的PRISMA为基础的45篇同行评审和监管来源的回顾开发而成,与NIST AI RMF、ISO/IEC 42001:2023、欧盟AI法案和MITRE ATLAS等领先标准保持一致。该框架整合了现有的偏见检测、隐私保护(差分隐私、联邦学习)、安全部署、可解释性和安全模型退役等方法。一个医疗保健案例研究展示了SC-NLP-LMF如何检测新兴术语漂移(例如,与COVID相关的语言),并指导合规模型更新。该框架为组织提供了一种实用的、生命周期范围内的结构,用于在高风险环境中开发、部署和维护安全和可问责的NLP系统。
Summary / 总结
This paper addresses the security, privacy, and regulatory compliance risks associated with NLP systems in sensitive domains. It introduces the Secure and Compliant NLP Lifecycle Management Framework (SC-NLP-LMF), a six-phase model based on a systematic review of 45 sources. The framework integrates methods for bias detection, privacy protection, secure deployment, explainability, and secure decommissioning. A healthcare case study demonstrates how SC-NLP-LMF detects and guides updates for emerging terminology drift, ensuring compliance and security throughout the NLP system lifecycle.
本文针对在敏感领域使用的NLP系统所面临的安全、隐私和合规风险,提出了Secure and Compliant NLP Lifecycle Management Framework (SC-NLP-LMF)六阶段模型。该框架通过PRISMA方法审查了45个来源后开发而成,集成了偏见检测、隐私保护、安全部署、可解释性和安全弃用的方法。一个医疗案例研究展示了其在检测和指导合规更新新兴术语漂移方面的有效性。
MAI-UI Technical Report: Real-World Centric Foundation GUI Agents
Authors: Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, Steven Hoi
First: 2025-12-26T14:51:52+00:00 · Latest: 2025-12-26T14:51:52+00:00
Abstract
The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B, and 235B-A22B variants. We identify four key challenges to realistic deployment: the lack of native agent-user interaction, the limits of UI-only operation, the absence of a practical deployment architecture, and brittleness in dynamic environments. MAI-UI addresses these issues with a unified methodology: a self-evolving data pipeline that expands the navigation data to include user interaction and MCP tool calls, a native device-cloud collaboration system routes execution by task state, and an online RL framework with advanced optimizations to scale parallel environments and context length. MAI-UI establishes new state-of-the-art across GUI grounding and mobile navigation. On grounding benchmarks, it reaches 73.5% on ScreenSpot-Pro, 91.3% on MMBench GUI L2, 70.9% on OSWorld-G, and 49.2% on UI-Vision, surpassing Gemini-3-Pro and Seed1.8 on ScreenSpot-Pro. On mobile GUI navigation, it sets a new SOTA of 76.7% on AndroidWorld, surpassing UI-Tars-2, Gemini-2.5-Pro and Seed1.8. On MobileWorld, MAI-UI obtains 41.7% success rate, significantly outperforming end-to-end GUI models and competitive with Gemini-3-Pro based agentic frameworks. Our online RL experiments show significant gains from scaling parallel environments from 32 to 512 (+5.2 points) and increasing environment step budget from 15 to 50 (+4.3 points). Finally, the native device-cloud collaboration system improves on-device performance by 33%, reduces cloud model calls by over 40%, and preserves user privacy.
中文标题/摘要
标题:MAI-UI 技术报告:以现实为中心的 GUI 基础智能代理
GUI 代理的发展有可能革新下一代人机交互。受此愿景的驱动,我们提出了 MAI-UI,这是一个覆盖从 2B 到 235B-A22B 的全尺寸基础 GUI 代理家族。我们确定了现实部署中的四个关键挑战:缺乏原生代理-用户交互、UI 仅操作的限制、缺乏实用部署架构以及动态环境中的脆弱性。MAI-UI 通过统一的方法解决了这些问题:一个自进化数据管道,扩展导航数据以包括用户交互和 MCP 工具调用,一个原生设备-云协作系统根据任务状态路由执行,以及一个具有高级优化的在线 RL 框架,以扩展并行环境和上下文长度。MAI-UI 在 GUI 地基和移动导航方面建立了新的最先进的技术。在地基基准测试中,它在 ScreenSpot-Pro 达到 73.5%,在 MMBench GUI L2 达到 91.3%,在 OSWorld-G 达到 70.9%,在 UI-Vision 达到 49.2%,超过了 Gemini-3-Pro 和 Seed1.8 在 ScreenSpot-Pro 上的表现。在移动 GUI 导航方面,它在 AndroidWorld 达到了新的 SOTA 76.7%,超过了 UI-Tars-2、Gemini-2.5-Pro 和 Seed1.8。在 MobileWorld,MAI-UI 获得了 41.7% 的成功率,显著优于端到端 GUI 模型,并与基于代理框架的 Gemini-3-Pro 相当。我们的在线 RL 实验表明,从 32 扩展到 512 的并行环境规模提高了 5.2 个百分点,环境步长预算从 15 增加到 50 提高了 4.3 个百分点。最后,原生设备-云协作系统提高了设备端性能 33%,减少了超过 40% 的云模型调用,并保留了用户隐私。
Summary / 总结
The paper presents MAI-UI, a family of foundation GUI agents addressing key challenges for realistic deployment, such as native agent-user interaction and dynamic environment brittleness. It uses a unified methodology involving a self-evolving data pipeline, a native device-cloud collaboration system, and an online RL framework. MAI-UI achieves new state-of-the-art results on GUI grounding and mobile navigation benchmarks, surpassing previous models and setting new SOTA on AndroidWorld and ScreenSpot-Pro. Online RL experiments show improvements from scaling parallel environments and increasing environment step budgets. The native device-cloud collaboration system enhances on-device performance and reduces cloud model calls while preserving user privacy.
研究旨在通过GUI代理增强人机交互。MAI-UI通过集成用户交互、云协作和高级RL优化来应对关键挑战。它在GUI定位和移动导航方面取得了最先进的成果,超越了之前的模型,如ScreenSpot-Pro和AndroidWorld的基准测试。在线RL实验显示,通过扩大环境规模和增加步骤预算,可以取得显著改进。本地设备-云协作系统提升了设备端的性能,减少了云模型调用次数,并保护了用户隐私。
Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models
Authors: Zongmin Zhang, Zhen Sun, Yifan Liao, Wenhan Dong, Xinlei He, Xingshuo Han, Shengmin Xu, Xinyi Huang
First: 2025-12-26T14:48:58+00:00 · Latest: 2025-12-26T14:48:58+00:00
Abstract
Prompt-driven Video Segmentation Foundation Models (VSFMs) such as SAM2 are increasingly deployed in applications like autonomous driving and digital pathology, raising concerns about backdoor threats. Surprisingly, we find that directly transferring classic backdoor attacks (e.g., BadNet) to VSFMs is almost ineffective, with ASR below 5\%. To understand this, we study encoder gradients and attention maps and observe that conventional training keeps gradients for clean and triggered samples largely aligned, while attention still focuses on the true object, preventing the encoder from learning a distinct trigger-related representation. To address this challenge, we propose BadVSFM, the first backdoor framework tailored to prompt-driven VSFMs. BadVSFM uses a two-stage strategy: (1) steer the image encoder so triggered frames map to a designated target embedding while clean frames remain aligned with a clean reference encoder; (2) train the mask decoder so that, across prompt types, triggered frame-prompt pairs produce a shared target mask, while clean outputs stay close to a reference decoder. Extensive experiments on two datasets and five VSFMs show that BadVSFM achieves strong, controllable backdoor effects under diverse triggers and prompts while preserving clean segmentation quality. Ablations over losses, stages, targets, trigger settings, and poisoning rates demonstrate robustness to reasonable hyperparameter changes and confirm the necessity of the two-stage design. Finally, gradient-conflict analysis and attention visualizations show that BadVSFM separates triggered and clean representations and shifts attention to trigger regions, while four representative defenses remain largely ineffective, revealing an underexplored vulnerability in current VSFMs.
中文标题/摘要
标题:面向提示驱动视频分割基础模型的后门攻击
提示驱动视频分割基础模型(VSFMs)如SAM2在自动驾驶和数字病理学等应用中越来越广泛部署,引发了后门威胁的担忧。令人惊讶的是,我们发现直接将经典后门攻击(如BadNet)转移到VSFMs几乎无效,ASR低于5%。为了理解这一现象,我们研究了编码器梯度和注意力图,并观察到常规训练保持干净样本和触发样本的梯度几乎对齐,同时注意力仍然集中在真实目标上,防止编码器学习与触发相关的独特表示。为应对这一挑战,我们提出了BadVSFM,这是第一个针对提示驱动VSFMs的后门框架。BadVSFM采用两阶段策略:(1)引导图像编码器,使触发帧映射到指定的目标嵌入,同时保持干净帧与干净参考编码器对齐;(2)训练掩码解码器,使不同提示类型下的触发帧-提示对生成共享的目标掩码,而干净输出保持接近参考解码器。在两个数据集和五种VSFMs上的广泛实验表明,BadVSFM在多种触发和提示下实现了强大的可控后门效果,同时保持了干净分割的质量。损失、阶段、目标、触发设置和污染率的消融实验表明,该框架对合理的超参数变化具有鲁棒性,并证实了两阶段设计的必要性。最后,梯度冲突分析和注意力可视化表明,BadVSFM将触发和干净表示分离,并将注意力转移到触发区域,而四种代表性防御措施基本无效,揭示了当前VSFMs中未被充分探索的漏洞。
Summary / 总结
The paper addresses the vulnerability of Prompt-driven Video Segmentation Foundation Models (VSFMs) to backdoor attacks, particularly by proposing BadVSFM, a novel two-stage backdoor framework. Motivated by the increasing deployment of VSFMs in critical applications, the study finds that traditional backdoor attacks are ineffective. BadVSFM steers the image encoder to map triggered frames to a target embedding while keeping clean frames aligned with a clean reference encoder, and trains the mask decoder to produce a shared target mask for triggered frames. Experiments show that BadVSFM achieves strong, controllable backdoor effects while maintaining clean segmentation quality. Ablations confirm the robustness of the two-stage design and highlight the necessity of this approach against various hyperparameter changes and defenses.
本文探讨了Prompt驱动的视频分割基础模型(VSFMs)如SAM2在关键应用如自动驾驶和数字病理中的后门攻击漏洞。作者发现传统的后门攻击在VSFMs上无效。他们提出了BadVSFM,这是一种两阶段框架,通过引导图像编码器和训练掩码解码器来实现强可控的后门效果,同时保持干净的分割质量。实验表明,BadVSFM在不同触发器和提示下有效工作,并且对超参数变化具有鲁棒性。
MAD: Multi-Alignment MEG-to-Text Decoding
Authors: Yiqian Yang, Hyejeong Jo, Yiqun Duan, Qiang Zhang, Jinni Zhou, Xuming Hu, Won Hee Lee, Renjing Xu, Hui Xiong
First: 2024-06-03T16:43:10+00:00 · Latest: 2025-12-26T14:41:38+00:00
Abstract
Deciphering language from brain activity is a crucial task in brain-computer interface (BCI) research. Non-invasive cerebral signaling techniques including electroencephalography (EEG) and magnetoencephalography (MEG) are becoming increasingly popular due to their safety and practicality, avoiding invasive electrode implantation. However, current works under-investigated three points: 1) a predominant focus on EEG with limited exploration of MEG, which provides superior signal quality; 2) poor performance on unseen text, indicating the need for models that can better generalize to diverse linguistic contexts; 3) insufficient integration of information from other modalities, which could potentially constrain our capacity to comprehensively understand the intricate dynamics of brain activity. This study presents a novel approach for translating MEG signals into text using a speech-decoding framework with multiple alignments. Our method is the first to introduce an end-to-end multi-alignment framework for totally unseen text generation directly from MEG signals. We achieve an impressive BLEU-1 score on the \textit{GWilliams} dataset, significantly outperforming the baseline from 5.49 to 6.86 on the BLEU-1 metric. This improvement demonstrates the advancement of our model towards real-world applications and underscores its potential in advancing BCI research. Code is available at $\href{https://github.com/NeuSpeech/MAD-MEG2text}{https://github.com/NeuSpeech/MAD-MEG2text}$.
Summary / 总结
This study addresses the challenge of translating MEG signals into text, focusing on improving the performance on unseen text and integrating information from other modalities. The proposed method, MAD, introduces an end-to-end multi-alignment framework for generating text directly from MEG signals, achieving a BLEU-1 score of 6.86, which is a significant improvement over the baseline score of 5.49 on the GWilliams dataset, highlighting its potential in advancing BCI research.
该研究通过引入MAD,一种新的多对齐框架,将MEG信号转换为文本。该方法在GWilliams数据集上将BLEU-1分数从5.49提高到6.86,展示了更好的对未见过文本的泛化能力和在实际应用中的潜力。
AI Urban Scientist: Multi-Agent Collaborative Automation for Urban Research
Authors: Tong Xia, Jiankun Zhang, Ruiwen You, Ao Xu, Linghao Zhang, Tengyao Tu, Jingzhi Wang, Jinghua Piao, Yunke Zhang, Fengli Xu, Yong Li
First: 2025-11-26T01:17:35+00:00 · Latest: 2025-12-26T14:38:22+00:00
Abstract
Urban research aims to understand how cities operate and evolve as complex adaptive systems. With the rapid growth of urban data and analytical methodologies, the central challenge of the field has shifted from data availability to the integration of heterogeneous data into coherent, verifiable urban knowledge through multidisciplinary approaches. Recent advances in AI, particularly the emergence of large language models (LLMs), have enabled the development of AI scientists capable of autonomous reasoning, hypothesis generation, and data-driven experimentation, demonstrating substantial potential for autonomous urban research. However, most general-purpose AI systems remain misaligned with the domain-specific knowledge, methodological conventions, and inferential standards required in urban studies. Here, we introduce the AI Urban Scientist, a knowledge-driven multi-agent framework designed to support autonomous urban research. Grounded in hypotheses, peer-review feedback, datasets, and research methodologies distilled from large-scale prior studies, the system constructs structured domain knowledge that guides LLM-based agents to automatically generate hypotheses, identify and integrate multi-source urban datasets, conduct empirical analyses and simulations, and iteratively refine analytical methods. Through this process, the framework synthesizes new insights in urban science and accelerates the urban research lifecycle.
中文标题/摘要
标题:AI城市科学家:多智能体协作自动化城市研究
城市研究旨在理解城市作为复杂自适应系统的运作和演变。随着城市数据和分析方法的迅速增长,该领域的核心挑战已从数据可用性转变为通过多学科方法将异构数据整合为一致且可验证的城市知识。近年来,特别是大型语言模型(LLMs)的出现,使能够自主推理、假设生成和数据驱动实验的AI科学家得以发展,显示出在自主城市研究方面的巨大潜力。然而,大多数通用AI系统仍与城市研究所需的领域特定知识、方法论惯例和推理标准不一致。在此,我们介绍了AI城市科学家,这是一种基于知识的多智能体框架,旨在支持自主城市研究。该系统基于从大规模前期研究中提炼出的假设、同行评审反馈、数据集和研究方法,构建结构化的领域知识,指导基于LLM的智能体自动生成假设、识别和整合多源城市数据集、进行实证分析和模拟,并迭代优化分析方法。通过这一过程,该框架在城市科学中综合了新的见解,并加速了城市研究生命周期。
Summary / 总结
The AI Urban Scientist is a knowledge-driven multi-agent framework designed to support autonomous urban research. Motivated by the need to integrate heterogeneous urban data into coherent knowledge, the system uses large language models to autonomously generate hypotheses, integrate datasets, and conduct empirical analyses. Key findings include the ability to iteratively refine analytical methods and accelerate the urban research lifecycle through structured domain knowledge and iterative hypothesis refinement.
AI城市科学家是一个基于知识的多代理框架,旨在支持自主的城市研究。该系统通过使用大型语言模型自主生成假设、整合数据集和进行实证分析,解决将异构城市数据整合为一致知识的需求。关键发现包括通过结构化的领域知识和迭代的假设精炼,能够迭代改进分析方法并加速城市研究的生命周期。
MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive and MCP-Augmented Environments
Authors: Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, Yue Wang
First: 2025-12-22T14:31:28+00:00 · Latest: 2025-12-26T14:36:50+00:00
Abstract
Among existing online mobile-use benchmarks, AndroidWorld has emerged as the dominant benchmark due to its reproducible environment and deterministic evaluation; however, recent agents achieving over 90% success rates indicate its saturation and motivate the need for a more challenging benchmark. In addition, its environment lacks key application categories, such as e-commerce and enterprise communication, and does not reflect realistic mobile-use scenarios characterized by vague user instructions and hybrid tool usage. We introduce MobileWorld, a substantially more challenging benchmark designed to reflect real-world usage through 201 tasks across 20 applications. MobileWorld derives its difficulty from an emphasis on long-horizon, cross-application workflows, requiring nearly twice as many completion steps on average (27.8 vs. 14.3) and featuring a significantly higher proportion of multi-app tasks (62.2% vs. 9.5%) than AndroidWorld. To overcome the limitations of existing environments, MobileWorld achieves a balance between production-grade utility and reproducible evaluation by utilizing open-source alternatives to industry standards (e.g., Mattermost for Slack). This approach enables a fully observable and controlled environment through source code modification and direct backend database access for precise verification. MobileWorld also introduces novel task categories, including agent-user interaction and Model Context Protocol (MCP)-augmented tasks, for evaluating agents in user-aware, hybrid-tool scenarios. To facilitate evaluation, we develop a planner-executor agentic framework with extended action spaces to support user interactions and MCP calls. Our results reveal a sharp performance drop compared to AndroidWorld, with the best agentic framework and end-to-end model achieving 51.7% and 20.9% success rates, respectively, highlighting ample headroom for future research.
中文标题/摘要
标题:移动世界:在代理-用户互动和MCP增强环境中基准测试自主移动代理
在现有的在线移动使用基准中,由于其可重复的环境和确定性的评估,AndroidWorld已成为主导基准;然而,最近实现超过90%成功率的代理表明其饱和度,并促使需要更具挑战性的基准。此外,其环境缺乏诸如电子商务和企业通信等关键应用类别,也不反映由模糊用户指令和混合工具使用特征的现实移动使用场景。我们引入了MobileWorld,这是一个更具挑战性的基准,旨在通过20个应用程序中的200个任务来反映真实世界的使用情况。MobileWorld的难度在于强调跨应用的长期工作流程,平均需要完成的步骤几乎是AndroidWorld的两倍(27.8 vs. 14.3),并且多应用任务的比例也远高于AndroidWorld(62.2% vs. 9.5%)。为了克服现有环境的局限性,MobileWorld通过利用行业标准的开源替代品(例如,使用Mattermost代替Slack)实现了生产级实用性和可重复评估之间的平衡。这种方法通过源代码修改和直接访问后端数据库实现了一个完全可观察和可控的环境,以进行精确验证。MobileWorld还引入了新的任务类别,包括代理-用户交互和模型上下文协议(MCP)增强任务,以评估代理在用户感知和混合工具场景中的表现。为了便于评估,我们开发了一个扩展动作空间的规划执行代理框架,以支持用户交互和MCP调用。我们的结果显示与AndroidWorld相比,性能急剧下降,最佳代理框架和端到端模型的成功率分别为51.7%和20.9%,这表明未来研究有巨大的改进空间。
Summary / 总结
MobileWorld is a new benchmark designed to challenge autonomous mobile agents by incorporating real-world usage scenarios with 201 tasks across 20 applications, emphasizing long-horizon workflows and multi-app tasks. It uses open-source alternatives to industry standards for a fully observable environment and introduces novel task categories. Experimental results show a significant performance drop compared to AndroidWorld, with success rates of 51.7% and 20.9% for the best agentic framework and end-to-end model, respectively.
MobileWorld 是一个新的基准,旨在通过包含 20 个应用中的 201 任务来挑战自主移动代理,这些任务反映了真实世界的使用场景。它强调长周期、跨应用的工作流程,每个任务的平均步骤数为 27.8,而 AndroidWorld 中为 14.3。MobileWorld 还引入了新的任务类别,如代理-用户交互和 Model Context Protocol (MCP) 增强任务,以评估代理在混合工具场景中的表现。评估结果显示,最佳代理框架和端到端模型的成功率分别为 51.7% 和 20.9%。
Unveiling the Learning Mind of Language Models: A Cognitive Framework and Empirical Study
Authors: Zhengyu Hu, Jianxun Lian, Zheyuan Xiao, Seraphina Zhang, Tianfu Wang, Nicholas Jing Yuan, Xing Xie, Hui Xiong
First: 2025-06-16T13:24:50+00:00 · Latest: 2025-12-26T14:33:32+00:00
Abstract
Large language models (LLMs) have shown impressive capabilities across tasks such as mathematics, coding, and reasoning, yet their learning ability, which is crucial for adapting to dynamic environments and acquiring new knowledge, remains underexplored. In this work, we address this gap by introducing a framework inspired by cognitive psychology and education. Specifically, we decompose general learning ability into three distinct, complementary dimensions: Learning from Instructor (acquiring knowledge via explicit guidance), Learning from Concept (internalizing abstract structures and generalizing to new contexts), and Learning from Experience (adapting through accumulated exploration and feedback). We conduct a comprehensive empirical study across the three learning dimensions and identify several insightful findings, such as (i) interaction improves learning; (ii) conceptual understanding is scale-emergent and benefits larger models; and (iii) LLMs are effective few-shot learners but not many-shot learners. Based on our framework and empirical findings, we introduce a benchmark that provides a unified and realistic evaluation of LLMs' general learning abilities across three learning cognition dimensions. It enables diagnostic insights and supports evaluation and development of more adaptive and human-like models.
中文标题/摘要
标题:揭开语言模型的学习心智:认知框架与实证研究
大型语言模型(LLMs)在数学、编程和推理等任务上展现了令人印象深刻的性能,然而它们的学习能力——这对于适应动态环境和获取新知识至关重要——仍然未被充分探索。在本研究中,我们通过引入受认知心理学和教育启发的框架来填补这一空白。具体而言,我们将一般的学习能力分解为三个相互补充的维度:从导师学习(通过明确指导获取知识)、从概念学习(内化抽象结构并在新情境中泛化)和从经验学习(通过累积探索和反馈进行适应)。我们对这三个学习维度进行了全面的实证研究,并发现了几个有价值的发现,例如(i)互动可以提高学习效果;(ii)概念理解是规模涌现的,并且有利于更大的模型;(iii)LLMs 是有效的少样本学习者但不是多样本学习者。基于我们的框架和实证发现,我们引入了一个基准,该基准提供了对LLMs在三个认知学习维度上一般学习能力的统一和现实的评估。它提供了诊断性的见解,并支持对更适应性和类人模型的评估和开发。
Summary / 总结
This study aims to explore the learning mechanisms of large language models (LLMs) by introducing a framework inspired by cognitive psychology. The framework decomposes learning into three dimensions: Learning from Instructor, Learning from Concept, and Learning from Experience. The empirical study across these dimensions reveals that interaction enhances learning, conceptual understanding scales with model size, and LLMs excel in few-shot learning but struggle with many-shot learning. The research provides a benchmark for evaluating LLMs' general learning abilities and supports the development of more adaptive models.
该研究通过借鉴认知心理学,提出了一个框架来探讨大型语言模型(LLMs)的学习能力。该框架将学习分解为三个维度:从指导者学习、从概念学习和从经验学习。跨这些维度的实证研究发现,互动可以提高学习效果,概念理解随模型规模增加而增强,LLMs 在少量示例学习方面表现出色但在大量示例学习方面表现不佳。研究引入了一个基准,用于全面评估LLMs 的一般学习能力。
Degradation-Aware All-in-One Image Restoration via Latent Prior Encoding
Authors: S M A Sharif, Abdur Rehman, Fayaz Ali Dharejo, Radu Timofte, Rizwan Ali Naqvi
First: 2025-09-22T13:51:09+00:00 · Latest: 2025-12-26T14:28:16+00:00
Abstract
Real-world images often suffer from spatially diverse degradations such as haze, rain, snow, and low-light, significantly impacting visual quality and downstream vision tasks. Existing all-in-one restoration (AIR) approaches either depend on external text prompts or embed hand-crafted architectural priors (e.g., frequency heuristics); both impose discrete, brittle assumptions that weaken generalization to unseen or mixed degradations. To address this limitation, we propose to reframe AIR as learned latent prior inference, where degradation-aware representations are automatically inferred from the input without explicit task cues. Based on latent priors, we formulate AIR as a structured reasoning paradigm: (1) which features to route (adaptive feature selection), (2) where to restore (spatial localization), and (3) what to restore (degradation semantics). We design a lightweight decoding module that efficiently leverages these latent encoded cues for spatially-adaptive restoration. Extensive experiments across six common degradation tasks, five compound settings, and previously unseen degradations demonstrate that our method outperforms state-of-the-art (SOTA) approaches, achieving an average PSNR improvement of 1.68 dB while being three times more efficient.
中文标题/摘要
标题:基于潜在先验编码的综合去退化图像恢复
现实世界中的图像往往遭受空间多样化的退化,如雾霾、雨、雪和低光照,严重影响了视觉质量和下游视觉任务。现有的综合去退化(AIR)方法要么依赖外部文本提示,要么嵌入手工构建的架构先验(例如,频率启发式);这两种方法都施加了离散且脆弱的假设,削弱了对未见过或混合退化的泛化能力。为了解决这一局限性,我们提出将AIR重新定义为学习潜在先验推理,其中退化感知的表示可以从输入中自动推断,无需显式的任务提示。基于潜在先验,我们将AIR形式化为一种结构化推理范式:(1)哪些特征进行路由(自适应特征选择),(2)在哪里恢复(空间定位),(3)恢复什么(退化语义)。我们设计了一个轻量级的解码模块,有效地利用这些潜在编码线索进行空间自适应恢复。在六种常见退化任务、五种复合设置以及未见过的退化中进行的广泛实验表明,我们的方法优于最先进的(SOTA)方法,平均PSNR提高了1.68 dB,同时效率提高了三倍。
Summary / 总结
The paper addresses the challenge of restoring images with various degradations such as haze, rain, and low-light conditions. It proposes a degradation-aware all-in-one image restoration method that infers latent priors from the input image without external prompts or hand-crafted priors. The method formulates the restoration process as a structured reasoning paradigm, focusing on adaptive feature selection, spatial localization, and degradation semantics. Experimental results show that the proposed method outperforms existing approaches, with an average PSNR improvement of 1.68 dB and higher efficiency.
论文针对具有多种退化(如雾、雨、低光照等)的现实世界图像恢复问题。提出了一种新颖的一站式图像恢复方法,该方法直接从输入图像中学习潜在先验,无需外部提示或手工制作的先验。该方法将恢复过程建模为一个结构化的推理任务,关注自适应特征选择、空间定位和退化语义。实验结果表明,所提出的方法优于现有最先进的技术,平均PSNR提高了1.68 dB,并且效率提高了三倍。
Advancing Multimodal Teacher Sentiment Analysis:The Large-Scale T-MED Dataset & The Effective AAM-TSA Model
Authors: Zhiyi Duan, Xiangren Wang, Hongyu Yuan, Qianli Xing
First: 2025-12-23T17:42:16+00:00 · Latest: 2025-12-26T14:11:13+00:00
Abstract
Teachers' emotional states are critical in educational scenarios, profoundly impacting teaching efficacy, student engagement, and learning achievements. However, existing studies often fail to accurately capture teachers' emotions due to the performative nature and overlook the critical impact of instructional information on emotional expression. In this paper, we systematically investigate teacher sentiment analysis by building both the dataset and the model accordingly. We construct the first large-scale teacher multimodal sentiment analysis dataset, T-MED. To ensure labeling accuracy and efficiency, we employ a human-machine collaborative labeling process. The T-MED dataset includes 14,938 instances of teacher emotional data from 250 real classrooms across 11 subjects ranging from K-12 to higher education, integrating multimodal text, audio, video, and instructional information. Furthermore, we propose a novel asymmetric attention-based multimodal teacher sentiment analysis model, AAM-TSA. AAM-TSA introduces an asymmetric attention mechanism and hierarchical gating unit to enable differentiated cross-modal feature fusion and precise emotional classification. Experimental results demonstrate that AAM-TSA significantly outperforms existing state-of-the-art methods in terms of accuracy and interpretability on the T-MED dataset.
中文标题/摘要
标题:推进多模态教师情感分析:T-MED数据集与有效的AAM-TSA模型
教师的情感状态在教育场景中至关重要,深刻影响着教学效果、学生参与度和学习成果。然而,现有研究往往由于表演性因素而未能准确捕捉教师的情感,并且忽视了教学信息对情感表达的关键影响。本文系统地研究了教师情感分析,构建了相应的数据集和模型。我们构建了首个大规模教师多模态情感分析数据集T-MED。为了确保标注准确性和效率,我们采用了人机协作标注过程。T-MED数据集包含来自11个学科的250个真实教室的14,938个教师情感数据实例,范围从K-12到高等教育,整合了多模态文本、音频、视频和教学信息。此外,我们提出了一种新颖的非对称注意力机制多模态教师情感分析模型AAM-TSA。AAM-TSA引入了非对称注意力机制和分层门控单元,以实现不同模态特征的差异化融合和精确的情感分类。实验结果表明,AAM-TSA在T-MED数据集上的准确性和可解释性显著优于现有最先进的方法。
Summary / 总结
This paper addresses the importance of teachers' emotional states in education by developing the T-MED dataset and the AAM-TSA model. T-MED is a large-scale multimodal dataset that includes 14,938 instances of teacher emotional data from 250 classrooms, integrating text, audio, video, and instructional information. The AAM-TSA model uses an asymmetric attention mechanism and hierarchical gating unit to improve cross-modal feature fusion and emotional classification accuracy. Experimental results show that AAM-TSA outperforms existing methods in terms of accuracy and interpretability on the T-MED dataset.
本文通过开发大规模多模态数据集T-MED和有效的不对称注意力机制模型AAM-TSA,关注教师情绪状态在教育中的重要性。T-MED包含来自250个教室的14,938个教师情绪数据实例,整合了文本、音频、视频和教学信息。AAM-TSA利用不对称注意力机制和分层门控单元,提高跨模态特征融合和情绪分类的准确性,在T-MED数据集上优于现有方法。
Universal Reasoning Model
Authors: Zitian Gao, Lynx Chen, Yihao Xiao, He Xing, Ran Tao, Haoming Luo, Joey Zhou, Bryan Dai
First: 2025-12-16T18:58:45+00:00 · Latest: 2025-12-26T13:44:48+00:00
Abstract
Universal transformers (UTs) have been widely used for complex reasoning tasks such as ARC-AGI and Sudoku, yet the specific sources of their performance gains remain underexplored. In this work, we systematically analyze UTs variants and show that improvements on ARC-AGI primarily arise from the recurrent inductive bias and strong nonlinear components of Transformer, rather than from elaborate architectural designs. Motivated by this finding, we propose the Universal Reasoning Model (URM), which enhances the UT with short convolution and truncated backpropagation. Our approach substantially improves reasoning performance, achieving state-of-the-art 53.8% pass@1 on ARC-AGI 1 and 16.0% pass@1 on ARC-AGI 2. Our code is avaliable at https://github.com/UbiquantAI/URM.
中文标题/摘要
标题:通用推理模型
通用变换器(UTs)已被广泛用于复杂的推理任务,如ARC-AGI和数独,但其性能提升的具体来源尚未充分探索。在本文中,我们系统地分析了UTs的各种变体,并表明ARC-AGI上的改进主要来自Transformer的递归归纳偏见和强大的非线性组件,而不是复杂的架构设计。受这一发现的启发,我们提出了通用推理模型(URM),该模型通过短卷积和截断反向传播增强了UT。我们的方法显著提高了推理性能,在ARC-AGI 1上达到了最先进的53.8% pass@1,在ARC-AGI 2上达到了16.0% pass@1。我们的代码可在https://github.com/UbiquantAI/URM 获取。
Summary / 总结
This study investigates the performance gains of universal transformers (UTs) in complex reasoning tasks like ARC-AGI and Sudoku. It finds that UTs' improvements mainly come from their recurrent inductive bias and strong nonlinear components. Motivated by this, the authors propose the Universal Reasoning Model (URM), which adds short convolution and truncated backpropagation to UTs. The URM significantly enhances reasoning performance, achieving the best results of 53.8% pass@1 on ARC-AGI 1 and 16.0% pass@1 on ARC-AGI 2.
该研究探讨了通用变压器(UTs)在复杂推理任务如ARC-AGI和数独中的性能提升原因,发现UTs的改进主要来自其递归归纳偏见和强大的非线性组件。基于这一发现,作者提出了通用推理模型(URM),该模型在UTs中增加了短卷积和截断反向传播。URM显著提升了推理性能,在ARC-AGI 1中达到最佳结果53.8% pass@1,在ARC-AGI 2中达到16.0% pass@1。
Accelerating Training Speed of Tiny Recursive Models with Curriculum Guided Adaptive Recursion
Authors: Kaleem Ullah Qasim, Jiashu Zhang
First: 2025-11-11T08:17:23+00:00 · Latest: 2025-12-26T13:40:53+00:00
Abstract
Background: Recursive reasoning models achieve strong performance through iterative refinement, allowing small networks to match large language models. However, training is computationally expensive, often requiring 36 GPU-hours for Sudoku extreme. Existing models use fixed recursion depth and uniform supervision weighting, leading to inefficient training. Objectives: We propose CGAR (Curriculum-Guided Adaptive Recursion), applying curriculum learning to architectural depth. CGAR introduces Progressive Depth Curriculum (PDC) to dynamically adjust recursion depth and Hierarchical Supervision Weighting (HSW) to apply exponentially decaying importance to supervision steps. Methods: PDC implements a three-stage schedule transitioning from shallow (2, 1) to full depth (6, 3) configurations, providing 41.4% FLOPs reduction. HSW applies exponential decay to supervision steps, achieving 40% gradient variance reduction and accelerated convergence. Results: On Sudoku-Extreme, CGAR achieves 1.71x training speedup (10.93 to 6.38 hours) with only a 0.63% accuracy drop (86.65% to 86.02%). PDC alone achieves 2.26x speedup with 85.47% accuracy, showing a Pareto improvement in efficiency and quality. HSW provides 1.61x speedup. CGAR-trained models show superior inference efficiency with 100% halting accuracy and 11% fewer reasoning steps. Conclusions: CGAR enables efficient training of recursive models on modest hardware. By treating depth as a scheduled parameter, it achieves substantial savings and prevents overfitting, making these models practical for neurosymbolic AI and program synthesis. https://github.com/Kaleemullahqasim/CGAR and huggingface.co/Kaleemullah/trm-cgar-sudoku.
中文标题/摘要
标题:使用课程引导自适应递归加速小型递归模型的训练速度
背景:递归推理模型通过迭代细化实现强大的性能,使小型网络能够匹敌大型语言模型。然而,训练计算成本高昂,通常需要36个GPU小时来完成数独极限任务。现有模型使用固定的递归深度和均匀的监督权重,导致训练效率低下。目标:我们提出了CGAR(课程引导自适应递归),将课程学习应用于架构深度。CGAR引入了渐进深度课程(PDC)来动态调整递归深度,并引入了层次监督权重(HSW)来对监督步骤应用指数衰减的重要性。方法:PDC实现了一个三阶段计划,从浅层(2, 1)过渡到全深度(6, 3)配置,提供了41.4%的FLOPs减少。HSW对监督步骤应用指数衰减,实现了40%的梯度方差减少和加速收敛。结果:在数独极限任务上,CGAR实现了1.71倍的训练加速(从10.93小时到6.38小时),准确率仅下降0.63%(从86.65%到86.02%)。PDC单独实现了2.26倍的加速,准确率为85.47%,显示了效率和质量的帕累托改进。HSW提供了1.61倍的加速。CGAR训练的模型在推理效率上表现出色,具有100%的停止准确率和11%更少的推理步骤。结论:CGAR使递归模型在有限硬件上高效训练成为可能。通过将深度视为计划参数,它实现了显著的节省并防止过拟合,使这些模型适用于神经符号AI和程序合成。https://github.com/Kaleemullahqasim/CGAR 和 huggingface.co/Kaleemullah/trm-cgar-sudoku。
Meta-Learning-Based Handover Management in NextG O-RAN
Authors: Michail Kalntis, George Iosifidis, José Suárez-Varela, Andra Lutu, Fernando A. Kuipers
First: 2025-12-26T13:01:46+00:00 · Latest: 2025-12-26T13:01:46+00:00
Abstract
While traditional handovers (THOs) have served as a backbone for mobile connectivity, they increasingly suffer from failures and delays, especially in dense deployments and high-frequency bands. To address these limitations, 3GPP introduced Conditional Handovers (CHOs) that enable proactive cell reservations and user-driven execution. However, both handover (HO) types present intricate trade-offs in signaling, resource usage, and reliability. This paper presents unique, countrywide mobility management datasets from a top-tier mobile network operator (MNO) that offer fresh insights into these issues and call for adaptive and robust HO control in next-generation networks. Motivated by these findings, we propose CONTRA, a framework that, for the first time, jointly optimizes THOs and CHOs within the O-RAN architecture. We study two variants of CONTRA: one where users are a priori assigned to one of the HO types, reflecting distinct service or user-specific requirements, as well as a more dynamic formulation where the controller decides on-the-fly the HO type, based on system conditions and needs. To this end, it relies on a practical meta-learning algorithm that adapts to runtime observations and guarantees performance comparable to an oracle with perfect future information (universal no-regret). CONTRA is specifically designed for near-real-time deployment as an O-RAN xApp and aligns with the 6G goals of flexible and intelligent control. Extensive evaluations leveraging crowdsourced datasets show that CONTRA improves user throughput and reduces both THO and CHO switching costs, outperforming 3GPP-compliant and Reinforcement Learning (RL) baselines in dynamic and real-world scenarios.
Summary / 总结
This paper addresses the limitations of traditional handovers (THOs) and introduces Conditional Handovers (CHOs) in dense deployments. Motivated by the need for adaptive and robust HO control, the authors propose CONTRA, a framework that jointly optimizes THOs and CHOs within the O-RAN architecture. CONTRA uses a meta-learning algorithm to adapt to runtime observations and outperforms 3GPP-compliant and RL baselines, improving user throughput and reducing switching costs.
本文针对传统切换(THOs)和条件切换(CHOs)在下一代网络中的局限性,提出了CONTRA框架,该框架在O-RAN架构中联合优化THOs和CHOs。CONTRA使用元学习算法适应运行时观察,并实现与具有完美未来信息的Oracle相当的性能。评估结果显示,CONTRA提高了用户吞吐量并减少了切换成本,在动态场景中优于3GPP合规和强化学习(RL)基线。
History
20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553