arXiv 论文速递

2025-12-16 03:25
Snapshot: 20251216_0325
Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation
Authors: Yang Fei, George Stoica, Jingyuan Liu, Qifeng Chen, Ranjay Krishna, Xiaojuan Wang, Benlin Liu
First: 2025-12-12T18:56:35+00:00 · Latest: 2025-12-12T18:56:35+00:00
Comments: Project Website: https://sam2videox.github.io/
Abstract
Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60\% on VBench, 21-22\% lower FVD, and 71.4\% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51\%, surpassing REPA (92.91\%) by 2.60\%, and reduce FVD to 360.57, a 21.20\% and 22.46\% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at https://sam2videox.github.io/ .
中文标题/摘要
标题:从跟踪中推断结构:提炼保结构运动以生成视频
现实是刚性约束与可变形结构之间的舞蹈。对于视频模型来说,这意味着生成既保持保真度又保持结构的运动。尽管在扩散模型方面取得了进展,但生成现实的保结构运动仍然具有挑战性,尤其是对于如人类和动物等具有关节和可变形物体。迄今为止,仅扩大训练数据尚未解决物理上不合理的过渡问题。现有方法依赖于使用噪声运动表示(如光学流或外部不完美模型提取的骨架)进行条件化。为了解决这些挑战,我们提出了一种算法,将来自自回归视频跟踪模型(SAM2)的保结构运动先验提炼到双向视频扩散模型(CogVideoX)中。通过我们的方法,我们训练了SAM2VideoX,其中包含两项创新:(1)双向特征融合模块,从类似于SAM2的递归模型中提取全局保结构运动先验;(2)局部格拉姆流损失,使局部特征的移动方式保持一致。在VBench上的实验和人类研究中,SAM2VideoX在VBench上实现了95.51%,超越了REPA(92.91%)2.60%,并将FVD降低到360.57,分别比REPA-微调和LoRA-微调提高了21.20%和22.46%。项目网站可访问 https://sam2videox.github.io/ 。
Summary / 总结
This research aims to generate realistic and structure-preserving motion in videos, addressing the challenges of physically plausible transitions for articulated and deformable objects. The method involves using an autoregressive video tracking model (SAM2) to distill structure-preserving motion priors into a bidirectional video diffusion model (CogVideoX). Experiments show that SAM2VideoX outperforms prior baselines on VBench and human preference tests, achieving a 2.60% improvement and 21-22% lower FVD compared to previous methods like REPA and LoRA-finetuning.
研究旨在通过解决现有扩散模型的局限性,生成真实且结构保持的视频运动。方法是使用自回归视频跟踪模型(SAM2)提取结构保持的运动先验,然后将其注入双向视频扩散模型(CogVideoX)。实验表明,SAM2VideoX 比先前基线方法表现更好,VBench 提高了 2.60%,FVD 降低了 21-22%,人类偏好度提高了 71.4%。
Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective
Authors: Etienne Boursier, Claire Boyer
First: 2025-12-12T18:54:52+00:00 · Latest: 2025-12-12T18:54:52+00:00
Abstract
Softmax attention is a central component of transformer architectures, yet its nonlinear structure poses significant challenges for theoretical analysis. We develop a unified, measure-based framework for studying single-layer softmax attention under both finite and infinite prompts. For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure. Building on this insight, we establish non-asymptotic concentration bounds for the output and gradient of softmax attention, quantifying how rapidly the finite-prompt model approaches its infinite-prompt counterpart, and prove that this concentration remains stable along the entire training trajectory in general in-context learning settings with sub-Gaussian tokens. In the case of in-context linear regression, we use the tractable infinite-prompt dynamics to analyze training at finite prompt length. Our results allow optimization analyses developed for linear attention to transfer directly to softmax attention when prompts are sufficiently long, showing that large-prompt softmax attention inherits the analytical structure of its linear counterpart. This, in turn, provides a principled and broadly applicable toolkit for studying the training dynamics and statistical behavior of softmax attention layers in large prompt regimes.
中文标题/摘要
标题:Softmax作为大提示下的线性注意力:基于测度的观点
Softmax注意力是变压器架构中的核心组件,但其非线性结构给理论分析带来了重大挑战。我们开发了一个统一的基于测度的框架,用于研究在有限和无限提示下单层softmax注意力。对于独立同分布的高斯输入,我们利用softmax操作在无限提示极限下收敛于作用于底层输入-标记测度的线性操作这一事实。基于这一洞察,我们建立了softmax注意力输出和梯度的非渐近收敛界,量化了有限提示模型如何迅速接近其无限提示对应物,并证明在一般子高斯标记的上下文学习设置中,这种收敛在训练轨迹的整个过程中保持稳定。在上下文线性回归的情况下,我们利用可处理的无限提示动力学分析有限提示长度下的训练。我们的结果允许直接将针对线性注意力的优化分析转移到足够长的提示下的softmax注意力,表明大提示下的softmax注意力继承了其线性对应物的分析结构。这反过来为研究softmax注意力层在大提示下的训练动力学和统计行为提供了一个原则性的和广泛适用的工具箱。
Summary / 总结
The paper develops a measure-based framework to study softmax attention in transformers, focusing on both finite and infinite prompt regimes. By leveraging the infinite-prompt limit where softmax converges to a linear operator, the authors establish concentration bounds for the output and gradients, showing that finite-prompt models approach their infinite-prompt counterparts rapidly and stably during training. This allows optimization analyses for linear attention to be applied to softmax attention in large-prompt settings, providing insights into the training dynamics and statistical behavior of softmax layers.
论文开发了一种基于测度的方法来分析transformer中的softmax注意力机制,特别是在大提示长度的情况下。通过利用softmax在无限提示极限下收敛到线性操作符的事实,作者建立了输出和梯度的集中性边界,表明有限提示模型能够快速且稳定地接近其无限提示的对应物。这种分析允许在提示足够长时将线性注意力的优化技术直接应用于softmax注意力,提供了关于softmax层在大提示设置下的训练动态和统计行为的见解。
Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously
Authors: Andrew Adiletta, Kathryn Adiletta, Kemal Derya, Berk Sunar
First: 2025-12-12T18:52:09+00:00 · Latest: 2025-12-12T18:52:09+00:00
Comments: 13 pages, 5 Figures
Abstract
The rapid deployment of Large Language Models (LLMs) has created an urgent need for enhanced security and privacy measures in Machine Learning (ML). LLMs are increasingly being used to process untrusted text inputs and even generate executable code, often while having access to sensitive system controls. To address these security concerns, several companies have introduced guard models, which are smaller, specialized models designed to protect text generation models from adversarial or malicious inputs. In this work, we advance the study of adversarial inputs by introducing Super Suffixes, suffixes capable of overriding multiple alignment objectives across various models with different tokenization schemes. We demonstrate their effectiveness, along with our joint optimization technique, by successfully bypassing the protection mechanisms of Llama Prompt Guard 2 on five different text generation models for malicious text and code generation. To the best of our knowledge, this is the first work to reveal that Llama Prompt Guard 2 can be compromised through joint optimization. Additionally, by analyzing the changing similarity of a model's internal state to specific concept directions during token sequence processing, we propose an effective and lightweight method to detect Super Suffix attacks. We show that the cosine similarity between the residual stream and certain concept directions serves as a distinctive fingerprint of model intent. Our proposed countermeasure, DeltaGuard, significantly improves the detection of malicious prompts generated through Super Suffixes. It increases the non-benign classification rate to nearly 100%, making DeltaGuard a valuable addition to the guard model stack and enhancing robustness against adversarial prompt attacks.
中文标题/摘要
标题:超级后缀:同时绕过文本生成对齐和防护模型
大型语言模型(LLMs)的快速部署迫切需要在机器学习(ML)中增强安全和隐私措施。LLMs 越来越多地被用于处理不可信的文本输入,甚至生成可执行代码,同时可能拥有访问敏感系统控制的权限。为应对这些安全问题,多家公司引入了防护模型,这是一种较小的、专门设计的模型,旨在保护文本生成模型免受敌对或恶意输入的影响。在本研究中,我们通过引入超级后缀推进了对抗输入的研究,超级后缀能够在不同分词方案的多种模型中同时覆盖多个对齐目标。我们通过成功绕过 Llama Prompt Guard 2 对五种不同文本生成模型的恶意文本和代码生成保护机制,展示了其有效性以及我们的联合优化技术。据我们所知,这是首次证明 Llama Prompt Guard 2 可以通过联合优化被攻破的工作。此外,通过分析模型在处理分词序列过程中内部状态与特定概念方向相似性的变化,我们提出了一种有效且轻量级的方法来检测超级后缀攻击。我们表明,残差流与某些概念方向之间的余弦相似度充当了模型意图的独特指纹。我们提出的对策 DeltaGuard 显著提高了对通过超级后缀生成的恶意提示的检测率,使其非良性分类率接近 100%,使 DeltaGuard 成为防护模型堆栈中的重要补充,增强了对抗敌对提示攻击的鲁棒性。
Summary / 总结
This paper addresses the security risks posed by Large Language Models (LLMs) when processing untrusted text inputs and generating executable code. It introduces Super Suffixes, which can bypass the protection mechanisms of Llama Prompt Guard 2 across different text generation models. The authors also propose DeltaGuard, a lightweight method to detect Super Suffix attacks by analyzing the model's internal state similarity to specific concept directions, significantly improving the detection rate of malicious prompts.
该论文引入了Super Suffixes,这是一种能够绕过不同文本生成模型多种对齐目标的后缀,以应对大型语言模型的安全问题。作者通过成功绕过五个模型上的Llama Prompt Guard 2来证明Super Suffixes的有效性。他们还提出了一种名为DeltaGuard的轻量级防御措施,通过计算余量流与特定概念方向之间的余弦相似度来检测Super Suffix攻击,显著提高了恶意提示的检测率至近100%。
Agile Flight Emerges from Multi-Agent Competitive Racing
Authors: Vineet Pasumarti, Lorenzo Bianchi, Antonio Loquercio
First: 2025-12-12T18:48:50+00:00 · Latest: 2025-12-12T18:48:50+00:00
Abstract
Through multi-agent competition and the sparse high-level objective of winning a race, we find that both agile flight (e.g., high-speed motion pushing the platform to its physical limits) and strategy (e.g., overtaking or blocking) emerge from agents trained with reinforcement learning. We provide evidence in both simulation and the real world that this approach outperforms the common paradigm of training agents in isolation with rewards that prescribe behavior, e.g., progress on the raceline, in particular when the complexity of the environment increases, e.g., in the presence of obstacles. Moreover, we find that multi-agent competition yields policies that transfer more reliably to the real world than policies trained with a single-agent progress-based reward, despite the two methods using the same simulation environment, randomization strategy, and hardware. In addition to improved sim-to-real transfer, the multi-agent policies also exhibit some degree of generalization to opponents unseen at training time. Overall, our work, following in the tradition of multi-agent competitive game-play in digital domains, shows that sparse task-level rewards are sufficient for training agents capable of advanced low-level control in the physical world. Code: https://github.com/Jirl-upenn/AgileFlight_MultiAgent
中文标题/摘要
标题:敏捷飞行源自多智能体竞速比赛
通过多智能体竞争和获胜比赛的稀疏高层目标,我们发现,敏捷飞行(例如,高速运动使平台达到物理极限)和策略(例如,超越或阻挡)均源自使用强化学习训练的智能体。我们在模拟和现实世界中提供了证据,表明这种方法在环境复杂性增加(例如,存在障碍物)时,优于单独训练智能体并用规定行为的奖励进行训练的常见范式。此外,我们发现多智能体竞争产生的策略在现实世界中的转移性比使用单智能体进度奖励训练的策略更可靠,尽管两种方法使用相同的模拟环境、随机化策略和硬件。除了改进的模拟到现实世界的转移性,多智能体策略还表现出一定程度的对未在训练中遇到的对手的泛化能力。总体而言,我们的工作,沿袭了数字领域多智能体竞争游戏的传统,表明稀疏的任务级奖励足以训练出能够在物理世界中执行高级低级控制的智能体。
Summary / 总结
The research aims to explore how multi-agent competition can lead to the emergence of agile flight and strategic behavior in agents trained with reinforcement learning. The method involves training agents in a competitive racing environment with sparse high-level rewards, focusing on winning the race. Key experimental findings show that this approach outperforms single-agent training with detailed rewards, especially in complex environments with obstacles. Additionally, multi-agent policies exhibit better sim-to-real transfer and some generalization to unseen opponents.
研究旨在通过强化学习探索多智能体竞争如何产生敏捷飞行和策略行为。方法是在竞速环境中使用稀疏的高层面奖励训练智能体,专注于赢得比赛。关键实验发现表明,这种方法在复杂环境(如存在障碍物)中比使用详细奖励的单智能体训练表现更优。此外,多智能体策略在模拟到现实世界的转移方面表现更好,并且对训练中未见过的对手也具有一定的泛化能力,相比之下单智能体策略则不然。
Reducing Domain Gap with Diffusion-Based Domain Adaptation for Cell Counting
Authors: Mohammad Dehghanmanshadi, Wallapak Tavanapong
First: 2025-12-12T18:19:41+00:00 · Latest: 2025-12-12T18:19:41+00:00
Comments: Accepted at ICMLA 2025
Abstract
Generating realistic synthetic microscopy images is critical for training deep learning models in label-scarce environments, such as cell counting with many cells per image. However, traditional domain adaptation methods often struggle to bridge the domain gap when synthetic images lack the complex textures and visual patterns of real samples. In this work, we adapt the Inversion-Based Style Transfer (InST) framework originally designed for artistic style transfer to biomedical microscopy images. Our method combines latent-space Adaptive Instance Normalization with stochastic inversion in a diffusion model to transfer the style from real fluorescence microscopy images to synthetic ones, while weakly preserving content structure. We evaluate the effectiveness of our InST-based synthetic dataset for downstream cell counting by pre-training and fine-tuning EfficientNet-B0 models on various data sources, including real data, hard-coded synthetic data, and the public Cell200-s dataset. Models trained with our InST-synthesized images achieve up to 37\% lower Mean Absolute Error (MAE) compared to models trained on hard-coded synthetic data, and a 52\% reduction in MAE compared to models trained on Cell200-s (from 53.70 to 25.95 MAE). Notably, our approach also outperforms models trained on real data alone (25.95 vs. 27.74 MAE). Further improvements are achieved when combining InST-synthesized data with lightweight domain adaptation techniques such as DACS with CutMix. These findings demonstrate that InST-based style transfer most effectively reduces the domain gap between synthetic and real microscopy data. Our approach offers a scalable path for enhancing cell counting performance while minimizing manual labeling effort. The source code and resources are publicly available at: https://github.com/MohammadDehghan/InST-Microscopy.
中文标题/摘要
标题:基于扩散模型的风格迁移方法减少领域差距以用于细胞计数
生成逼真的合成显微镜图像对于在标签稀缺环境中训练深度学习模型至关重要,例如每张图像中有许多细胞的细胞计数。然而,传统的领域适应方法在合成图像缺乏真实样本的复杂纹理和视觉模式时,往往难以弥合领域差距。在本文中,我们将最初用于艺术风格迁移的反转基于风格迁移(InST)框架适应到生物医学显微镜图像中。我们的方法结合了潜在空间自适应实例归一化与扩散模型中的随机反转,将真实荧光显微镜图像的风格转移到合成图像上,同时弱地保留内容结构。 我们通过在各种数据源上预训练和微调EfficientNet-B0模型来评估基于InST的合成数据集在下游细胞计数中的有效性,包括真实数据、硬编码的合成数据和公共Cell200-s数据集。使用我们基于InST合成图像训练的模型在平均绝对误差(MAE)上比使用硬编码合成数据训练的模型低37%,比使用Cell200-s训练的模型低52%(从53.70降至25.95 MAE)。值得注意的是,我们的方法在仅使用真实数据训练的模型上也表现出色(25.95 vs. 27.74 MAE)。通过结合InST合成数据与轻量级领域适应技术(如DACS与CutMix),可以进一步提高性能。这些发现表明,基于InST的风格迁移最有效地减少了合成和真实显微镜数据之间的领域差距。我们的方法提供了一种可扩展的途径,以提高细胞计数性能并减少手动标注工作量。源代码和资源可在以下链接获取:https://github.com/MohammadDehghan/InST-Microscopy。
Summary / 总结
This study addresses the challenge of bridging the domain gap between synthetic and real microscopy images for cell counting. The authors adapt the Inversion-Based Style Transfer (InST) framework to transfer the style from real images to synthetic ones while preserving content structure. Evaluations show that models trained with InST-synthesized images achieve up to 37% lower Mean Absolute Error (MAE) compared to models trained on hard-coded synthetic data and a 52% reduction in MAE compared to models trained on Cell200-s. Combining InST-synthesized data with lightweight domain adaptation techniques further improves performance. The approach effectively reduces the domain gap and enhances cell counting performance with minimal manual labeling effort.
该研究旨在解决合成和真实显微镜图像之间在细胞计数中的领域差距问题。作者将Inversion-Based Style Transfer (InST)框架应用于从真实图像向合成图像转移样式,同时保留内容结构。评估结果显示,使用InST合成图像训练的模型相比使用硬编码合成数据训练的模型,Mean Absolute Error (MAE)最多可降低37%,相比使用Cell200-s训练的模型,MAE降低52%。结合InST合成数据和轻量级领域适应技术(如DACS与CutMix)进一步提高了性能。该方法有效减少了领域差距,同时在最小化手动标注努力的情况下提升了细胞计数性能。
SUMFORU: An LLM-Based Review Summarization Framework for Personalized Purchase Decision Support
Authors: Yuming Feng, Xinrui Jiang
First: 2025-12-12T18:05:52+00:00 · Latest: 2025-12-12T18:05:52+00:00
Comments: Code available at https://github.com/Harry20030331/SumForU
Abstract
Online product reviews contain rich but noisy signals that overwhelm users and hinder effective decision-making. Existing LLM-based summarizers remain generic and fail to account for individual preferences, limiting their practical utility. We propose SUMFORU, a steerable review summarization framework that aligns outputs with explicit user personas to support personalized purchase decisions. Our approach integrates a high-quality data pipeline built from the Amazon 2023 Review Dataset with a two-stage alignment procedure: (1) persona-aware Supervised Fine-Tuning (SFT) via asymmetric knowledge distillation, and (2) Reinforcement Learning with AI Feedback (RLAIF) using a preference estimator to capture fine-grained, persona-relevant signals. We evaluate the model across rule-based, LLM-based, and human-centered metrics, demonstrating consistent improvements in consistency, grounding, and preference alignment. Our framework achieves the highest performance across all evaluation settings and generalizes effectively to unseen product categories. Our results highlight the promise of steerable pluralistic alignment for building next-generation personalized decision-support systems.
中文标题/摘要
标题:SUMFORU:基于LLM的个性化购买决策支持评论总结框架
在线产品评论包含丰富的但杂乱的信号,使用户感到困惑并妨碍有效的决策制定。现有的基于LLM的总结器仍然通用,未能考虑个人偏好,限制了其实用价值。我们提出SUMFORU,这是一种可引导的评论总结框架,通过与明确的用户人像对齐来支持个性化购买决策。我们的方法结合了从亚马逊2023评论数据集中构建的高质量数据管道,并采用两阶段对齐程序:(1) 通过不对称知识蒸馏进行的具有人像意识的监督微调(SFT),(2) 使用偏好估计器进行强化学习与AI反馈(RLAIF)。我们使用基于规则、基于LLM和基于人类中心的指标对模型进行评估,展示了在一致性和定位方面的持续改进以及偏好对齐。我们的框架在所有评估设置中均表现出最高的性能,并且能够有效泛化到未见过的产品类别。我们的结果突显了可引导的多元对齐在构建下一代个性化决策支持系统方面的潜力。
Summary / 总结
The paper addresses the challenge of overwhelming and noisy online product reviews by proposing SUMFORU, a personalized review summarization framework. It uses a two-stage alignment process involving persona-aware Supervised Fine-Tuning and Reinforcement Learning with AI Feedback to tailor summaries to individual user preferences. The model outperforms existing methods in consistency, grounding, and preference alignment across various evaluation metrics and generalizes well to new product categories.
论文针对在线产品评论中信息过载和噪声问题,提出了SUMFORU,一种个性化评论摘要框架。该框架通过两阶段对齐过程,即基于人设的监督微调和基于AI反馈的强化学习,来定制摘要以适应个人用户偏好。实验结果显示,在多种评估指标上,该模型在一致性、定位和偏好对齐方面表现出持续改进,并且能够很好地泛化到新的产品类别中。
MTTR-A: Measuring Cognitive Recovery Latency in Multi-Agent Systems
Authors: Barak Or
First: 2025-11-08T21:29:18+00:00 · Latest: 2025-12-12T17:56:26+00:00
Comments: preprint
Abstract
Ensuring cognitive stability in autonomous multi-agent systems (MAS) is a central challenge for large-scale, distributed AI. While existing observability tools monitor system outputs, they cannot quantify how rapidly agentic workflows recover once reasoning coherence has been lost. We adapt classical reliability metrics-Mean Time-to-Recovery (MTTR), Mean Time Between Failures (MTBF), and related ratios-into the cognitive domain, defining MTTR-A (Mean Time-to-Recovery for Agentic Systems) as a runtime measure of cognitive recovery latency. MTTR-A quantifies the time required for a MAS to detect reasoning drift and restore consistent operation, capturing the recovery of reasoning coherence rather than infrastructural repair. A benchmark simulation using the AG~News corpus and the LangGraph orchestration framework was conducted, modeling recovery latencies across multiple reflex modes. Automated reflexes restored stability within approximately 6s on average, while human-approval interventions required about 12s. Across 200 runs, the median simulated MTTR-A was 6.21+-2.14s, MTBF=6.7+-2.14s, and NRR=0.08, demonstrating measurable runtime resilience across reflex strategies. By formalizing recovery latency as a quantifiable property of distributed reasoning-and deriving reliability bounds linking recovery time and cognitive uptime-this work establishes a foundation for runtime dependability in agentic cognition, transforming cognitive recovery from an ad-hoc process into a standardized, interpretable performance
中文标题/摘要
标题:MTTR-A:多智能体系统中认知恢复延迟的度量
确保自主多智能体系统(MAS)的认知稳定性是大规模分布式人工智能中的核心挑战。现有可观测性工具监控系统输出,但无法量化智能体工作流在推理一致性丢失后恢复的速度。我们借鉴经典可靠性指标——平均恢复时间(MTTR)、平均故障间隔时间(MTBF)及相关比率——将其引入认知领域,定义MTTR-A(智能体系统平均恢复时间)作为运行时的认知恢复延迟度量。MTTR-A量化了MAS检测推理漂移并恢复一致运行所需的时间,捕捉的是推理一致性的恢复而非基础设施的修复。 使用AG News语料库和LangGraph编排框架进行了基准模拟,模拟了多种反射模式下的恢复延迟。自动反射在大约6秒内恢复了稳定性,而人工审批干预则需要约12秒。在200次运行中,模拟的中位数MTTR-A为6.21±2.14秒,MTBF=6.7±2.14秒,NRR=0.08,表明不同反射策略下的运行时弹性具有可测量性。 通过将恢复延迟形式化为分布式推理的可量化属性,并推导出恢复时间和认知运行时间之间的可靠性界限,这项工作为智能体认知的运行时可靠性奠定了基础,将认知恢复从一种随意的过程转变为一种标准化、可解释的性能指标。
Summary / 总结
This paper introduces MTTR-A, a measure for cognitive recovery latency in multi-agent systems (MAS), adapting classical reliability metrics to the cognitive domain. The study uses a benchmark simulation with the AG News corpus and LangGraph to evaluate reflex modes, showing that automated reflexes restore stability in about 6 seconds, while human interventions take around 12 seconds. The median simulated MTTR-A across 200 runs was 6.21±2.14 seconds, indicating measurable runtime resilience.
本文引入了MTTR-A,这是一种衡量多代理系统(MAS)认知恢复延迟的指标,将经典可靠性指标应用于认知领域。研究使用AG News语料库和LangGraph进行基准模拟,评估不同反射模式下的恢复延迟,结果显示自动化反射平均在6秒内恢复稳定性,而人工审批干预则需要约12秒。在200次运行中,模拟的MTTR-A中位数为6.21±2.14秒,表明反射策略在运行时具有可测量的弹性。
UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI
Authors: Darvin Yi, Teng Liu, Mattie Terzolo, Lance Hasson, Ayan Sinha, Pablo Mendes, Andrew Rabinovich
First: 2025-11-15T17:39:37+00:00 · Latest: 2025-12-12T17:51:50+00:00
Abstract
As large language model (LLM) agents increasingly undertake digital work, reliable frameworks are needed to evaluate their real-world competence, adaptability, and capacity for human collaboration. Existing benchmarks remain largely static, synthetic, or domain-limited, providing limited insight into how agents perform in dynamic, economically meaningful environments. We introduce UpBench, a dynamically evolving benchmark grounded in real jobs drawn from the global Upwork labor marketplace. Each task corresponds to a verified client transaction, anchoring evaluation in genuine work activity and financial outcomes. UpBench employs a rubric-based evaluation framework, in which expert freelancers decompose each job into detailed, verifiable acceptance criteria and assess AI submissions with per-criterion feedback. This structure enables fine-grained analysis of model strengths, weaknesses, and instruction-following fidelity beyond binary pass/fail metrics. Human expertise is integrated throughout the data pipeline (from job curation and rubric construction to evaluation) ensuring fidelity to real professional standards and supporting research on human-AI collaboration. By regularly refreshing tasks to reflect the evolving nature of online work, UpBench provides a scalable, human-centered foundation for evaluating agentic systems in authentic labor-market contexts, offering a path toward a collaborative framework, where AI amplifies human capability through partnership rather than replacement.
中文标题/摘要
标题:UpBench:一种基于真实世界的劳动力市场代理基准框架,旨在构建以人为本的AI
随着大型语言模型(LLM)代理越来越多地承担数字工作,需要可靠的框架来评估其在现实世界中的能力、适应性和与人类协作的能力。现有基准大多保持静态、合成或领域限制,提供的洞察有限,无法反映动态、经济上有意义的环境中的代理表现。我们介绍了UpBench,这是一种基于全球Upwork劳动力市场的实际工作的动态演变基准。每个任务对应一个经过验证的客户交易,将评估锚定在真实的劳动活动和财务结果上。UpBench采用基于评分的评估框架,其中专家自由职业者将每个任务分解为详细的、可验证的接受标准,并对AI提交内容进行逐项反馈评估。这种结构使我们能够对模型的优势、弱点和指令遵循的准确性进行精细分析,超越了二元通过/未通过的度量标准。在整个数据管道中(从任务筛选、评分标准构建到评估)整合了人类专业知识,确保符合真实的专业标准,并支持人类与AI协作的研究。通过定期更新任务以反映在线工作的演变,UpBench为评估代理系统在真实的劳动力市场环境中的表现提供了可扩展、以人为本的基础,提供了一条通往合作框架的道路,在这种框架中,AI通过伙伴关系而非替代来增强人类能力。
Summary / 总结
UpBench is a dynamically evolving benchmark for evaluating the real-world competence, adaptability, and human collaboration capabilities of large language model agents. It uses tasks from the global Upwork labor marketplace, ensuring evaluation in genuine work activity and financial outcomes. The benchmark employs a rubric-based evaluation framework where expert freelancers assess AI submissions with detailed, verifiable criteria, providing a fine-grained analysis of model strengths and weaknesses. By regularly refreshing tasks, UpBench offers a scalable, human-centered foundation for evaluating agentic systems in authentic labor-market contexts.
UpBench 是一个动态演化的基准框架,用于评估大型语言模型代理在现实世界中的专业能力、适应性和与人类的合作能力。它使用来自 Upwork 劳动力市场的任务,确保评估基于真实的日常工作活动。该基准采用基于评分表的评估框架,其中专家自由职业者对 AI 提交进行详细评估并提供反馈,从而提供比二元通过/未通过指标更细致的分析。这种方法在整个数据管道中整合了人类的专业知识,反映了实际的专业标准,并支持人类与 AI 的合作研究。
REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving
Authors: Annabelle Sujun Tang, Christopher Priebe, Rohan Mahapatra, Lianhui Qin, Hadi Esmaeilzadeh
Venue: NeurIPS 2025
First: 2025-06-02T07:02:46+00:00 · Latest: 2025-12-12T17:38:28+00:00
Comments: NeurIPS 2025
Abstract
While model serving has unlocked unprecedented capabilities, the high cost of serving large-scale models continues to be a significant barrier to widespread accessibility and rapid innovation. Compiler optimizations have long driven substantial performance improvements, but existing compilers struggle with neural workloads due to the exponentially large and highly interdependent space of possible transformations. Although existing stochastic search techniques can be effective, they are often sample-inefficient and fail to leverage the structural context underlying compilation decisions. We set out to investigate the research question of whether reasoning with large language models (LLMs), without any retraining, can leverage the context-aware decision space of compiler optimizations to significantly improve sample efficiency. To that end, we introduce a novel compilation framework (dubbed Reasoning Compiler) that formulates optimization as a sequential, context-aware decision process guided by a large language model and structured Monte Carlo tree search (MCTS). The LLM acts as a proposal mechanism, suggesting hardware-informed transformations that reflect the current program state and accumulated performance feedback. MCTS incorporates the LLM-generated proposals to balance exploration and exploitation, facilitating structured, context-sensitive traversal of the expansive compiler optimization space. By achieving substantial speedups with markedly fewer samples than leading neural compilers, our approach demonstrates the potential of LLM-guided reasoning to transform the landscape of compiler optimization.
中文标题/摘要
标题:编译器:由大语言模型指导的高效模型服务优化
尽管模型服务解锁了前所未有的能力,但大规模模型的高成本继续成为广泛访问和快速创新的主要障碍。编译器优化长期以来推动了显著的性能改进,但现有编译器在处理神经工作负载时遇到困难,因为可能的转换空间既庞大又高度相互依赖。尽管现有的随机搜索技术可能有效,但它们通常样本效率低下,并且未能利用编译决策下的结构上下文。我们着手研究一个研究问题:是否可以通过不重新训练的方式,利用大语言模型(LLMs)的上下文感知决策空间,显著提高样本效率。为此,我们引入了一种新颖的编译框架(称为推理编译器),该框架将优化过程表述为由大语言模型和结构化蒙特卡洛树搜索(MCTS)引导的顺序、上下文感知决策过程。大语言模型作为提议机制,建议反映当前程序状态和累积性能反馈的硬件导向变换。MCTS结合大语言模型生成的提议,平衡探索和利用,促进对庞大编译优化空间的结构化、上下文敏感遍历。通过在显著减少样本数量的情况下实现显著加速,我们的方法展示了LLM指导的推理在编译优化领域潜在的变革能力。
Summary / 总结
The research aims to improve the efficiency of serving large-scale models by leveraging large language models (LLMs) for compiler optimizations. The Reasoning Compiler framework formulates optimization as a sequential, context-aware decision process guided by an LLM and structured Monte Carlo tree search (MCTS). This approach achieves significant speedups with fewer samples compared to existing neural compilers, demonstrating the potential of LLM-guided reasoning to enhance compiler optimization efficiency.
研究旨在通过改进编译器优化来解决大规模模型服务的高成本问题。该研究引入了一种名为Reasoning Compiler的框架,利用大型语言模型和结构化的蒙特卡洛树搜索来引导优化决策,通过上下文感知的建议进行硬件相关的转换。该方法在比现有神经编译器更少的样本下实现了显著的加速,表明LLM引导的推理在提高编译器优化效率方面的潜力。
DeepSeek's WEIRD Behavior: The cultural alignment of Large Language Models and the effects of prompt language and cultural prompting
Authors: James Luther, Donald Brown
First: 2025-12-10T15:54:18+00:00 · Latest: 2025-12-12T17:25:30+00:00
Abstract
Culture is a core component of human-to-human interaction and plays a vital role in how we perceive and interact with others. Advancements in the effectiveness of Large Language Models (LLMs) in generating human-sounding text have greatly increased the amount of human-to-computer interaction. As this field grows, the cultural alignment of these human-like agents becomes an important field of study. Our work uses Hofstede's VSM13 international surveys to understand the cultural alignment of the following models: DeepSeek-V3, V3.1, GPT-4, GPT-4.1, GPT-4o, and GPT-5. We use a combination of prompt language and cultural prompting, a strategy that uses a system prompt to shift a model's alignment to reflect a specific country, to align these LLMs with the United States and China. Our results show that DeepSeek-V3, V3.1, and OpenAI's GPT-5 exhibit a close alignment with the survey responses of the United States and do not achieve a strong or soft alignment with China, even when using cultural prompts or changing the prompt language. We also find that GPT-4 exhibits an alignment closer to China when prompted in English, but cultural prompting is effective in shifting this alignment closer to the United States. Other low-cost models, GPT-4o and GPT-4.1, respond to the prompt language used (i.e., English or Simplified Chinese) and cultural prompting strategies to create acceptable alignments with both the United States and China.
中文标题/摘要
标题:DeepSeek的奇特行为:大型语言模型的文化对齐及其提示语言和文化提示的影响
文化是人与人之间互动的核心组成部分,对我们的感知和互动方式起着至关重要的作用。大型语言模型(LLMs)在生成人类语言文本方面效果的提升,极大地增加了人机互动的数量。随着这一领域的增长,这些类人代理的文化对齐成为了一个重要的研究领域。我们的研究使用霍夫斯泰德的VSM13国际调查来理解以下模型的文化对齐情况:DeepSeek-V3、V3.1、GPT-4、GPT-4.1、GPT-4o和GPT-5。我们使用提示语言和文化提示的策略,通过系统提示来调整模型的对齐,使其反映特定国家的文化,以使这些LLMs与美国和中国对齐。结果显示,DeepSeek-V3、V3.1和OpenAI的GPT-5与美国的调查响应表现出紧密的对齐,即使使用文化提示或改变提示语言,也无法实现与中国较强的或温和的对齐。我们还发现,当用英语提示时,GPT-4更接近中国的对齐,但文化提示策略有效果地将这种对齐调整得更接近美国。其他低成本模型GPT-4o和GPT-4.1会根据使用的提示语言(即英语或简体中文)和文化提示策略来创建与美国和中国都可接受的对齐。
Summary / 总结
This study investigates the cultural alignment of Large Language Models (LLMs) using Hofstede's VSM13 international surveys. The research employs a combination of prompt language and cultural prompting to align models with the United States and China. Key findings indicate that DeepSeek-V3, V3.1, and GPT-5 closely align with U.S. survey responses but fail to achieve strong or soft alignment with China, even with cultural prompts. GPT-4 shows a closer alignment with China when prompted in English, but cultural prompting can shift its alignment towards the U.S. GPT-4o and GPT-4.1 respond to prompt language and cultural prompting strategies to create acceptable alignments with both the U.S. and China.
本研究使用Hofstede的VSM13国际调查来研究大型语言模型的文化对齐情况。研究采用组合提示语言和文化提示的方法,将模型与美国和中国对齐。研究发现,DeepSeek-V3、V3.1和GPT-5与美国调查响应高度一致,但在使用文化提示或改变提示语言时未能与中国实现强烈或软性对齐。GPT-4在用英语提示时与中国的对齐更接近,但文化提示可以将其对齐转向美国。GPT-4o和GPT-4.1对提示语言和文化提示策略做出响应,能够与美国和中国都达到可接受的对齐水平。
SOF: Sorted Opacity Fields for Fast Unbounded Surface Reconstruction
Authors: Lukas Radl, Felix Windisch, Thomas Deixelberger, Jozef Hladky, Michael Steiner, Dieter Schmalstieg, Markus Steinberger
Venue: SIGGRAPH Asia 2025
First: 2025-06-23T21:20:52+00:00 · Latest: 2025-12-12T17:12:11+00:00
Comments: SIGGRAPH Asia 2025; Project Page: https://r4dl.github.io/SOF/
Abstract
Recent advances in 3D Gaussian representations have significantly improved the quality and efficiency of image-based scene reconstruction. Their explicit nature facilitates real-time rendering and fast optimization, yet extracting accurate surfaces - particularly in large-scale, unbounded environments - remains a difficult task. Many existing methods rely on approximate depth estimates and global sorting heuristics, which can introduce artifacts and limit the fidelity of the reconstructed mesh. In this paper, we present Sorted Opacity Fields (SOF), a method designed to recover detailed surfaces from 3D Gaussians with both speed and precision. Our approach improves upon prior work by introducing hierarchical resorting and a robust formulation of Gaussian depth, which better aligns with the level-set. To enhance mesh quality, we incorporate a level-set regularizer operating on the opacity field and introduce losses that encourage geometrically-consistent primitive shapes. In addition, we develop a parallelized Marching Tetrahedra algorithm tailored to our opacity formulation, reducing meshing time by up to an order of magnitude. As demonstrated by our quantitative evaluation, SOF achieves higher reconstruction accuracy while cutting total processing time by more than a factor of three. These results mark a step forward in turning efficient Gaussian-based rendering into equally efficient geometry extraction.
中文标题/摘要
标题:SOF:排序透明度字段以实现快速无界表面重建
近年来,3D 高斯表示的进展显著提高了基于图像的场景重建的质量和效率。它们的显式性质便于实时渲染和快速优化,但提取准确的表面——特别是在大规模、无界环境中——仍然是一个难题。许多现有方法依赖于近似的深度估计和全局排序启发式,这可能会引入伪影并限制重建网格的保真度。在本文中,我们提出了排序透明度字段(SOF),这是一种旨在从3D高斯中恢复详细表面的方法,兼具速度和精度。我们的方法通过引入分层重新排序和高斯深度的稳健公式改进了先前的工作,这更好地与水平集对齐。为了提高网格质量,我们在透明度字段上引入了水平集正则化,并引入了鼓励几何一致的原始形状的损失。此外,我们开发了一种针对我们透明度公式进行并行化的Marching Tetrahedra算法,将网格生成时间减少了十倍。正如我们的定量评估所显示的,SOF在提高重建精度的同时将总处理时间减少了三倍以上。这些结果标志着将高效的高斯渲染转化为同样高效的几何提取迈出了一步。
Summary / 总结
The research aims to improve the accuracy and efficiency of surface reconstruction from 3D Gaussian representations, especially in large-scale environments. The Sorted Opacity Fields (SOF) method introduces hierarchical resorting and a robust Gaussian depth formulation to better align with the level-set, and includes a level-set regularizer and geometrically-consistent losses to enhance mesh quality. The parallelized Marching Tetrahedra algorithm further accelerates meshing. Experimental results show that SOF achieves higher reconstruction accuracy and reduces processing time by more than 30%.
SOF 是一种从 3D 高斯模型快速且精确地恢复表面的方法,解决了在大规模环境中提取详细表面的挑战。它通过引入分层重新排序和稳健的高斯深度公式来提高准确性,并包含层次集正则化和专门的损失函数以提升网格质量。该方法还采用了一种并行化的四面体行进算法,显著减少了网格生成时间。定量评估表明,SOF 在重建精度上更高,并将处理时间减少了超过 300%。
From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines
Authors: Titaya Mairittha, Tanakon Sawanglok, Panuwit Raden, Jirapast Buntub, Thanapat Warunee, Napat Asawachaisuvikrom, Thanaphum Saiwongin
First: 2025-12-12T17:05:11+00:00 · Latest: 2025-12-12T17:05:11+00:00
Comments: 6 pages, 1 figure
Abstract
While voice-based AI systems have achieved remarkable generative capabilities, their interactions often feel conversationally broken. This paper examines the interactional friction that emerges in modular Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) pipelines. By analyzing a representative production system, we move beyond simple latency metrics to identify three recurring patterns of conversational breakdown: (1) Temporal Misalignment, where system delays violate user expectations of conversational rhythm; (2) Expressive Flattening, where the loss of paralinguistic cues leads to literal, inappropriate responses; and (3) Repair Rigidity, where architectural gating prevents users from correcting errors in real-time. Through system-level analysis, we demonstrate that these friction points should not be understood as defects or failures, but as structural consequences of a modular design that prioritizes control over fluidity. We conclude that building natural spoken AI is an infrastructure design challenge, requiring a shift from optimizing isolated components to carefully choreographing the seams between them.
中文标题/摘要
标题:从信号到转变:模块化语音到语音管道中的互动摩擦
尽管基于语音的AI系统在生成能力方面取得了显著进展,但它们的互动往往在对话上显得不连贯。本文探讨了模块化语音到语音检索增强生成(S2S-RAG)管道中出现的互动摩擦。通过分析一个代表性的生产系统,我们超越了简单的延迟指标,识别出三种反复出现的对话中断模式:(1)时间错位,其中系统延迟违反了用户对对话节奏的期望;(2)表达扁平化,其中失去的副语言线索导致字面且不适当的回应;(3)修复僵化,其中架构控制阻止用户在实时纠正错误。通过系统级分析,我们表明这些摩擦点不应被视为缺陷或失败,而是模块化设计结构上的后果,该设计优先考虑控制而非流畅性。我们得出结论,构建自然语音AI是一项基础设施设计挑战,需要从优化孤立组件转向精心编排它们之间的接缝。
Summary / 总结
This paper investigates the conversational breakdowns in modular Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) pipelines, focusing on three patterns: Temporal Misalignment, Expressive Flattening, and Repair Rigidity. By analyzing a production system, the authors identify these as structural consequences of a modular design that prioritizes control over fluidity, suggesting that building natural spoken AI requires a shift in infrastructure design from optimizing isolated components to carefully choreographing the interactions between them.
本文研究了模块化Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG)管道中的对话断裂问题,识别出三种模式:时间错位、表达单调和修复僵化。研究超越了延迟指标,表明这些问题是由模块化设计优先控制而非流畅性所导致的结构性后果。研究结果表明,构建自然语音AI需要从优化单个组件转向精心协调它们之间的互动。
Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale
Authors: Zhaodong Wang, Zhenting Qi, Sherman Wong, Nathan Hu, Samuel Lin, Jun Ge, Erwin Gao, Yining Yang, Ben Maurer, Wenlin Chen, David Recordon, Yilun Du, Minlan Yu, Ying Zhang
First: 2025-12-11T08:05:58+00:00 · Latest: 2025-12-12T16:59:12+00:00
Comments: Meta requires more thorough internal review process to ensure paper quality and experiments as well as compliance with the internal research publishing process
Abstract
Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.
中文标题/摘要
标题:孔夫子代码代理:工业规模的开源AI软件工程师
现实中的AI软件工程需要能够对大规模代码库进行推理、在长时间会话内外保持持久记忆,并在测试时稳健地协调复杂工具链的编码代理。现有的开源编码代理提供了透明性,但在推向工业规模的工作负载时经常表现不佳,而专有的编码代理则提供了强大的实际性能,但受限于扩展性、可解释性和可控性。我们介绍了孔夫子代码代理(CCA),这是一种可以在工业规模上运行的开源AI软件工程师。CCA基于孔夫子SDK构建,这是一个围绕代理体验(AX)、用户体验(UX)和开发体验(DX)三个互补视角设计的开源代理开发平台。SDK引入了一个统一的协调器,具有分层工作记忆,用于长上下文推理,一个持久的笔记系统,用于跨会话持续学习,以及一个模块化的扩展模块,用于稳健地使用工具。此外,一个元代理通过构建-测试-改进循环自动化编码代理配置的合成、评估和优化,从而实现快速开发新任务、环境和工具堆栈上的编码代理。通过这些机制在孔夫子SDK上实现,CCA在实际软件工程任务上表现出色。在SWE-Bench-Pro上,CCA实现了54.3%的最先进的Resolve@1性能,显著优于之前的编码代理。孔夫子SDK和CCA共同提供了一个透明、可扩展和可重复的基础架构,用于AI代理,填补了研究原型与生产级系统之间的差距,并支持工业规模的代理开发和部署。
Reframing Music-Driven 2D Dance Pose Generation as Multi-Channel Image Generation
Authors: Yan Zhang, Han Zou, Lincong Feng, Cong Xie, Ruiqi Yu, Zhenpeng Zhan
First: 2025-12-12T16:57:46+00:00 · Latest: 2025-12-12T16:57:46+00:00
Abstract
Recent pose-to-video models can translate 2D pose sequences into photorealistic, identity-preserving dance videos, so the key challenge is to generate temporally coherent, rhythm-aligned 2D poses from music, especially under complex, high-variance in-the-wild distributions. We address this by reframing music-to-dance generation as a music-token-conditioned multi-channel image synthesis problem: 2D pose sequences are encoded as one-hot images, compressed by a pretrained image VAE, and modeled with a DiT-style backbone, allowing us to inherit architectural and training advances from modern text-to-image models and better capture high-variance 2D pose distributions. On top of this formulation, we introduce (i) a time-shared temporal indexing scheme that explicitly synchronizes music tokens and pose latents over time and (ii) a reference-pose conditioning strategy that preserves subject-specific body proportions and on-screen scale while enabling long-horizon segment-and-stitch generation. Experiments on a large in-the-wild 2D dance corpus and the calibrated AIST++2D benchmark show consistent improvements over representative music-to-dance methods in pose- and video-space metrics and human preference, and ablations validate the contributions of the representation, temporal indexing, and reference conditioning. See supplementary videos at https://hot-dance.github.io
中文标题/摘要
标题:将音乐驱动的2D舞蹈姿态生成重新构想为多通道图像生成
近期的姿势到视频模型可以将2D姿态序列转化为具有保真度的、身份保持的舞蹈视频,因此关键挑战是从音乐中生成时间上连贯、节奏对齐的2D姿态,尤其是在复杂、高变异性的真实世界分布下。我们通过将音乐到舞蹈生成重新构想为音乐标记条件下的多通道图像合成问题来解决这一问题:2D姿态序列被编码为一热图像,通过预训练的图像VAE压缩,并使用DiT风格的骨干模型进行建模,使我们能够继承现代文本到图像模型的架构和训练进步,更好地捕捉2D姿态的高变异性分布。在此基础上,我们引入了(i)一种时间共享的时间索引方案,明确同步音乐标记和姿态潜变量随时间的变化,以及(ii)一种参考姿态条件策略,该策略保留了特定主体的身体比例和屏幕上的比例,同时允许长时段的片段和缝合生成。在大型真实世界2D舞蹈语料库和校准的AIST++2D基准测试上进行的实验显示,在姿态和视频空间度量以及人类偏好方面,该方法相对于代表性音乐到舞蹈方法的一致改进,并且消融实验验证了表示、时间索引和参考条件的贡献。请参见补充视频:https://hot-dance.github.io
Summary / 总结
This paper addresses the challenge of generating temporally coherent and rhythm-aligned 2D dance poses from music, using a reframed multi-channel image synthesis approach. The method encodes 2D pose sequences as one-hot images, compresses them with a pretrained image VAE, and models them with a DiT-style backbone. It introduces a time-shared temporal indexing scheme and a reference-pose conditioning strategy to better capture high-variance 2D pose distributions and preserve subject-specific body proportions. Experiments show consistent improvements over existing methods in pose- and video-space metrics and human preference, validating the contributions of the proposed techniques.
该论文通过重新定义为多通道图像合成问题,解决了从音乐生成时间连贯且节奏对齐的2D舞蹈姿态的挑战。方法将2D姿态序列编码为one-hot图像,通过预训练的图像VAE压缩,并使用DiT风格的骨干进行建模。引入了时间共享的时间索引方案和参考姿态条件策略,以更好地捕捉高变异性2D姿态分布并保持主体特定的身体比例。实验结果显示,在姿态和视频空间度量以及人类偏好方面,该方法比现有方法有持续改进,验证了所提技术的贡献。
Referring Change Detection in Remote Sensing Imagery
Authors: Yilmaz Korkmaz, Jay N. Paranjape, Celso M. de Melo, Vishal M. Patel
Venue: WACV
First: 2025-12-12T16:57:12+00:00 · Latest: 2025-12-12T16:57:12+00:00
Comments: 2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
Abstract
Change detection in remote sensing imagery is essential for applications such as urban planning, environmental monitoring, and disaster management. Traditional change detection methods typically identify all changes between two temporal images without distinguishing the types of transitions, which can lead to results that may not align with specific user needs. Although semantic change detection methods have attempted to address this by categorizing changes into predefined classes, these methods rely on rigid class definitions and fixed model architectures, making it difficult to mix datasets with different label sets or reuse models across tasks, as the output channels are tightly coupled with the number and type of semantic classes. To overcome these limitations, we introduce Referring Change Detection (RCD), which leverages natural language prompts to detect specific classes of changes in remote sensing images. By integrating language understanding with visual analysis, our approach allows users to specify the exact type of change they are interested in. However, training models for RCD is challenging due to the limited availability of annotated data and severe class imbalance in existing datasets. To address this, we propose a two-stage framework consisting of (I) \textbf{RCDNet}, a cross-modal fusion network designed for referring change detection, and (II) \textbf{RCDGen}, a diffusion-based synthetic data generation pipeline that produces realistic post-change images and change maps for a specified category using only pre-change image, without relying on semantic segmentation masks and thereby significantly lowering the barrier to scalable data creation. Experiments across multiple datasets show that our framework enables scalable and targeted change detection. Project website is here: https://yilmazkorkmaz1.github.io/RCD.
中文标题/摘要
标题:遥感图像中的变化检测
遥感图像的变化检测对于城市规划、环境监测和灾害管理等应用至关重要。传统的变化检测方法通常会在两个时间点的图像之间识别所有变化,但不区分变化类型,这可能导致结果不符合特定用户的需求。虽然语义变化检测方法试图通过将变化分类为预定义类别来解决这一问题,但这些方法依赖于固定的类别定义和模型架构,使得难以混合具有不同标签集的数据集或在不同任务中重用模型,因为输出通道与类别数量和类型紧密耦合。为克服这些限制,我们引入了引用变化检测(RCD),该方法利用自然语言提示来检测遥感图像中的特定类别变化。通过结合语言理解和视觉分析,我们的方法允许用户指定他们感兴趣的精确变化类型。然而,由于标注数据的有限可用性和现有数据集中类别不平衡严重,训练RCD模型具有挑战性。为解决这一问题,我们提出了一种两阶段框架,包括(I)RCDNet,一种用于引用变化检测的跨模态融合网络,以及(II)RCDGen,一种基于扩散的合成数据生成管道,该管道仅使用预变化图像生成指定类别的现实后变化图像和变化图,而不依赖于语义分割掩码,从而显著降低了大规模数据创建的门槛。在多个数据集上的实验表明,我们的框架能够实现可扩展和针对性的变化检测。项目网站在此:https://yilmazkorkmaz1.github.io/RCD/
Summary / 总结
The paper addresses the challenge of detecting specific types of changes in remote sensing imagery by introducing Referring Change Detection (RCD), which uses natural language prompts to identify desired changes. The method employs a two-stage framework: RCDNet, a cross-modal fusion network, and RCDGen, a synthetic data generation pipeline. Experiments demonstrate that this approach enables scalable and targeted change detection across multiple datasets.
该研究引入了基于自然语言提示的遥感图像变化检测方法(RCD),通过两阶段框架实现:RCDNet 进行跨模态融合,RCDGen 生成指定类别的合成后变化图像和变化图,无需依赖语义分割掩码,从而降低数据创建门槛。实验结果表明,该方法能够实现跨多个数据集的可扩展和针对性变化检测。
Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks
Authors: Sergey Pankratov, Dan Alistarh
First: 2025-12-12T16:54:33+00:00 · Latest: 2025-12-12T16:54:33+00:00
Abstract
Speculative generation has emerged as a promising technique to accelerate inference in large language models (LLMs) by leveraging parallelism to verify multiple draft tokens simultaneously. However, the fundamental limits on the achievable speedup remain poorly understood. In this work, we establish the first ``tight'' lower bounds on the runtime of any deterministic speculative generation algorithm. This is achieved by drawing a parallel between the token generation process and branching random walks, which allows us to analyze the optimal draft tree selection problem. We prove, under basic assumptions, that the expected number of tokens successfully predicted per speculative iteration is bounded as $\mathbb{E}[X] \leq (μ+ μ_{(2)})\log(P )/μ^2 + O(1)$, where $P$ is the verifier's capacity, $μ$ is the expected entropy of the verifier's output distribution, and $μ_{(2)}$ is the expected second log-moment. This result provides new insights into the limits of parallel token generation, and could guide the design of future speculative decoding systems. Empirical evaluations on Llama models validate our theoretical predictions, confirming the tightness of our bounds in practical settings.
Summary / 总结
This work investigates the fundamental limits of speculative generation in accelerating large language model inference. By comparing the token generation process to branching random walks, the authors derive the first tight lower bounds on the runtime of any deterministic speculative generation algorithm. The key finding is that the expected number of tokens successfully predicted per speculative iteration is bounded by a formula involving the verifier's capacity and output distribution properties. Empirical evaluations on Llama models support these theoretical predictions, demonstrating the practical relevance of the derived bounds.
本文研究了投机生成在加速大型语言模型(LLMs)方面的基本限制,通过建立第一个确定性投机生成算法的紧致下界来实现这一目标。通过将令牌生成过程与分支随机行走进行比较,作者证明了每轮投机预测的预期令牌数量受到特定公式的限制,该公式涉及验证器的能力和输出分布的熵。实验评估在Llama模型上验证了这些理论预测,证明了所推导界限的实际相关性。
Med-REFL: Medical Reasoning Enhancement via Self-Corrected Fine-grained Reflection
Authors: Zongxian Yang, Jiayu Qian, Zegao Peng, Haoyu Zhang, Yu-An Huang, KC Tan, Zhi-An Huang
First: 2025-06-11T14:58:38+00:00 · Latest: 2025-12-12T16:49:44+00:00
Abstract
Large reasoning models excel in domains like mathematics where intermediate reasoning is straightforward to verify, but struggle to self-correct in medicine fields where evaluating intermediate reasoning is cumbersome and expensive. This verification bottleneck hinders the development of reliable AI reasoners for high-stakes application. Here we propose Med-REFL, a novel framework that learns fine-grained reflection without human labels or model distillation. Med-REFL introduces a deterministic structural assessment of the reasoning space to automatically generate preference data for reflection. By globally evaluating all explored reasoning paths in a tree-of-thoughts, our method quantifies the value of corrective actions, enabling the automated construction of direct preference optimization pairs. This trains the model to recognize and amend its own reasoning fallacies. Extensive experiments show Med-REFL delivers robust gains across diverse models architectures and medical benchmarks, boosting a general-purpose Llama3.1-8B by +5.82% and the state-of-the-art Huatuo-o1 by +4.13% on the MedQA benchmark. Our Med-REFL-8B achieves state-of-the-art performance among 7-8B models while even competing with models twice its size. Crucially, targeted ablations prove its success generalizes to other domains such as logical reasoning and mitigates the `fake reflection' phenomenon in LRMs. Ultimately, our framework provides a scalable solution to the verification bottleneck, paving the way for more reliable AI reasoners in high-stakes domains like medicine. Med-REFL has been made publicly available in https://github.com/TianYin123/Med-REFL.
中文标题/摘要
标题:Med-REFL:通过自我纠正的细粒度反思提升医学推理能力
大型推理模型在数学等领域表现出色,因为中间推理易于验证,但在医学领域却难以自我纠正,因为评估中间推理既繁琐又昂贵。这种验证瓶颈阻碍了可靠AI推理器在高风险应用中的发展。为此,我们提出了一种名为Med-REFL的新框架,该框架无需人工标签或模型蒸馏即可学习细粒度的反思。Med-REFL引入了一种确定性的结构评估方法,以自动生成反思偏好数据。通过全局评估思维树中探索的所有推理路径,我们的方法量化了纠正行动的价值,从而能够自动构建直接的偏好优化对。这训练模型识别并修正其自身的推理谬误。广泛实验表明,Med-REFL在多种模型架构和医学基准测试中提供了稳健的改进,将通用Llama3.1-8B的性能提升了5.82%,将最先进的Huatuo-o1在MedQA基准测试中的性能提升了4.13%。我们的Med-REFL-8B在7-8B模型中达到了最先进的性能,甚至与两倍大小的模型竞争。关键的是,有针对性的消融实验表明,其成功可以推广到其他领域,如逻辑推理,并减轻LRMs中的“假反思”现象。最终,我们的框架提供了一种可扩展的解决方案,以克服验证瓶颈,为医学等高风险领域提供更可靠的AI推理器铺平了道路。Med-REFL已在https://github.com/TianYin123/Med-REFL/公开发布。
Summary / 总结
Med-REFL is a framework designed to enhance medical reasoning in AI models by enabling self-correction through fine-grained reflection without human labels or model distillation. It evaluates all reasoning paths in a tree-of-thoughts to generate preference data, which trains the model to recognize and correct its reasoning errors. Extensive experiments show Med-REFL improves performance across various models and medical benchmarks, with a notable boost of +5.82% for a general-purpose Llama3.1-8B and +4.13% for the state-of-the-art Huatuo-o1 on the MedQA benchmark.
Med-REFL 是一种框架,旨在通过精细的反思增强 AI 模型在医学领域的推理能力,使其能够自我纠正。它使用确定性的结构评估来自动生成偏好数据,无需人工标签或模型蒸馏,使模型能够识别并修正其推理错误。广泛的实验表明,Med-REFL 在各种医学基准测试中提高了性能,通用的 Llama3.1-8B 模型提高了 5.82%,而最先进的 Huatuo-o1 模型提高了 4.13%。
Text2Graph: Combining Lightweight LLMs and GNNs for Efficient Text Classification in Label-Scarce Scenarios
Authors: João Lucas Luz Lima Sarcinelli, Ricardo Marcondes Marcacini
First: 2025-12-10T20:31:30+00:00 · Latest: 2025-12-12T16:45:54+00:00
Abstract
Large Language Models (LLMs) have become effective zero-shot classifiers, but their high computational requirements and environmental costs limit their practicality for large-scale annotation in high-performance computing (HPC) environments. To support more sustainable workflows, we present Text2Graph, an open-source Python package that provides a modular implementation of existing text-to-graph classification approaches. The framework enables users to combine LLM-based partial annotation with Graph Neural Network (GNN) label propagation in a flexible manner, making it straightforward to swap components such as feature extractors, edge construction methods, and sampling strategies. We benchmark Text2Graph on a zero-shot setting using five datasets spanning topic classification and sentiment analysis tasks, comparing multiple variants against other zero-shot approaches for text classification. In addition to reporting performance, we provide detailed estimates of energy consumption and carbon emissions, showing that graph-based propagation achieves competitive results at a fraction of the energy and environmental cost.
中文标题/摘要
标题:Text2Graph:结合轻量级LLM和GNN的高效文本分类方法
大型语言模型(LLMs)已成为有效的零样本分类器,但其高计算需求和环境成本限制了其在高性能计算(HPC)环境中的大规模注释实用性。为了支持更可持续的工作流程,我们提出了Text2Graph,这是一个开源的Python包,提供了现有文本到图分类方法的模块化实现。该框架允许用户以灵活的方式结合基于LLM的部分注释与图神经网络(GNN)标签传播,使得可以方便地更换特征提取器、边构建方法和采样策略等组件。我们在五个涵盖主题分类和情感分析任务的数据集上对Text2Graph进行了零样本设置下的基准测试,将多种变体与其他文本分类的零样本方法进行了比较。除了报告性能外,我们还提供了详细的能耗和碳排放估算,表明基于图的传播在极低的能耗和环境成本下实现了具有竞争力的结果。
Summary / 总结
The paper introduces Text2Graph, an open-source Python package that combines lightweight Language Models (LLMs) and Graph Neural Networks (GNNs) for efficient text classification in label-scarce scenarios. It enables users to flexibly integrate LLM-based partial annotation with GNN label propagation, allowing for easy swapping of components. The framework was benchmarked on five datasets for topic classification and sentiment analysis, showing competitive performance while significantly reducing energy consumption and carbon emissions compared to other zero-shot approaches.
该论文介绍了Text2Graph,一个开源Python包,结合了轻量级的语言模型(LLMs)和图神经网络(GNNs),以高效地处理标签稀缺场景下的文本分类任务。该框架允许用户灵活地将LLM的部分注解与GNN标签传播相结合,并且可以轻松更换组件。该框架在五个数据集上进行了基准测试,涵盖了主题分类和情感分析任务,展示了与其它零样本方法相比,具有竞争力的性能同时显著降低了能源消耗和碳排放。
Uncertainty Distillation: Teaching Language Models to Express Semantic Confidence
Authors: Sophia Hager, David Mueller, Kevin Duh, Nicholas Andrews
First: 2025-03-18T21:29:29+00:00 · Latest: 2025-12-12T16:31:27+00:00
Abstract
As large language models (LLMs) are increasingly used for factual question-answering, it becomes more important for LLMs to have the capability to communicate the likelihood that their answer is correct. For these verbalized expressions of uncertainty to be meaningful, they should reflect the error rates at the expressed level of confidence. However, when prompted to express confidence, the error rates of current LLMs are inconsistent with their communicated confidences, highlighting the need for uncertainty quantification methods. Many prior methods calculate lexical uncertainty, estimating a model's confidence in the specific string it generated. In some cases, however, it may be more useful to estimate semantic uncertainty, or the model's confidence in the answer regardless of how it is verbalized. We propose a simple procedure, uncertainty distillation, to teach an LLM to verbalize calibrated semantic confidences. Using held-out data to map initial uncertainty estimates to meaningful probabilities, we create examples annotated with verbalized probabilities for supervised fine-tuning. We find that our method yields verbalized confidences that correlate well with observed error rates, even when compared to strong baselines, some of which are more than twenty times slower at inference time. Additionally, we demonstrate that our method can be applied to black-box models that allow API-based fine-tuning, resulting in estimates of uncertainty that are both more effective and more efficient than any of our baselines.
中文标题/摘要
标题:不确定性蒸馏:训练语言模型表达语义置信度
随着大型语言模型(LLMs)在事实问答中的应用越来越广泛,LLMs 具有传达其答案正确性的可能性变得越来越重要。为了使这些关于不确定性的口头表达有意义,它们应该反映在表达的置信水平下的错误率。然而,当被要求表达置信度时,当前 LLMs 的错误率与其传达的置信度不一致,这突显了需要不确定性量化方法的必要性。许多先前的方法计算词汇不确定性,估计模型对其生成的具体字符串的信心。然而,在某些情况下,估计语义不确定性,即模型对其答案的信心,而不考虑其如何口头表达,可能更有用。我们提出了一种简单的程序——不确定性蒸馏,以训练 LLM 口头表达校准的语义置信度。利用保留的数据将初始不确定性估计映射到有意义的概率,我们创建了带有口头化概率注释的示例,用于监督微调。我们发现,我们的方法产生的口头置信度与观察到的错误率相关性良好,即使与强大的基线方法相比也是如此,有些基线方法在推理时间上慢了二十多倍。此外,我们展示了我们的方法可以应用于允许基于 API 微调的黑盒模型,从而产生比任何基线方法都更有效且更高效的不确定性估计。
Summary / 总结
The research aims to improve the ability of large language models (LLMs) to express semantic confidence in their answers, which is crucial for factual question-answering tasks. The method involves a technique called uncertainty distillation, where LLMs are trained using held-out data to map initial uncertainty estimates to meaningful probabilities. This leads to calibrated verbalized confidences that correlate well with observed error rates, outperforming strong baselines in both effectiveness and efficiency. Additionally, the method can be applied to black-box models, enhancing their utility in practical applications.
研究旨在提高大型语言模型(LLMs)在回答事实性问题时表达语义置信度的能力。方法是通过使用保留数据将初始不确定性估计映射到有意义的概率,训练LLMs以表达校准后的语义置信度。这使得表达的置信度与观察到的错误率高度相关,优于强大的基线模型,在效果和效率上都更胜一筹。此外,该方法还可以应用于允许基于API微调的黑盒模型,提高其在实际应用中的实用性。
Integrating Ontologies with Large Language Models for Enhanced Control Systems in Chemical Engineering
Authors: Crystal Su, Kuai Yu, Jingrui Zhang, Mingyuan Shao, Daniel Bauer
First: 2025-10-30T18:04:20+00:00 · Latest: 2025-12-12T16:14:17+00:00
Comments: This paper is withdrawn due to issues with attribution and citation accuracy
Abstract
This work presents an ontology-integrated large language model (LLM) framework for chemical engineering that unites structured domain knowledge with generative reasoning. The proposed pipeline aligns model training and inference with the COPE ontology through a sequence of data acquisition, semantic preprocessing, information extraction, and ontology mapping steps, producing templated question-answer pairs that guide fine-tuning. A control-focused decoding stage and citation gate enforce syntactic and factual grounding by constraining outputs to ontology-linked terms, while evaluation metrics quantify both linguistic quality and ontological accuracy. Feedback and future extensions, including semantic retrieval and iterative validation, further enhance the system's interpretability and reliability. This integration of symbolic structure and neural generation provides a transparent, auditable approach for applying LLMs to process control, safety analysis, and other critical engineering contexts.
中文标题/摘要
标题:将本体与大型语言模型集成以增强化工领域的控制系统
本文提出了一种将本体集成到大型语言模型(LLM)框架中的方法,用于化工领域,将结构化的领域知识与生成性推理相结合。所提出的流水线通过数据获取、语义预处理、信息提取和本体映射等一系列步骤,将模型训练和推理与COPE本体对齐,生成模板化的问答对,指导微调。控制导向的解码阶段和引文门控通过限制输出到本体链接的术语来强化语法和事实基础,而评估指标则量化了语言质量和本体准确性。反馈和未来扩展,包括语义检索和迭代验证,进一步增强了系统的可解释性和可靠性。这种符号结构与神经生成的集成提供了一种透明且可审计的方法,将LLM应用于过程控制、安全分析和其他关键工程领域。
Summary / 总结
This work introduces an ontology-integrated large language model framework for chemical engineering, which combines structured domain knowledge with generative reasoning. The pipeline involves data acquisition, semantic preprocessing, information extraction, and ontology mapping to generate question-answer pairs that guide model fine-tuning. The decoding stage and citation gate ensure syntactic and factual grounding, while evaluation metrics assess both linguistic quality and ontological accuracy. Future extensions aim to improve interpretability and reliability through semantic retrieval and iterative validation.
该研究提出了一种结合结构化领域知识和生成推理的大型语言模型框架,用于化学工程。该管道包括数据获取、语义预处理、信息提取和本体映射,以生成用于微调的问题-答案对。解码阶段和引文门控确保语法和事实的接地,而评估指标则衡量语言质量和本体准确性。未来扩展包括语义检索和迭代验证,以提高可解释性和可靠性。这种集成增强了LLM在过程控制和安全分析中的应用,提供了一种透明和可审计的方法。
MedRule-KG: A Knowledge-Graph--Steered Scaffold for Reliable Mathematical and Biomedical Reasoning
Authors: Crystal Su
First: 2025-11-17T04:42:52+00:00 · Latest: 2025-12-12T16:08:56+00:00
Comments: This paper is withdrawn due to issues with attribution and citation accuracy
Abstract
We study how to impose domain-consistent structure on large language models (LLMs) used for scientific reasoning and early-stage drug discovery. We present MedRule-KG, a compact knowledge-graph scaffold paired with a lightweight verifier that steers generation toward mathematically and biomedically valid outputs. The system injects curated symbolic facts into prompts and then enforces rule satisfaction with a deterministic checker. We formalize generation as constrained inference, introduce a soft guidance surrogate suitable for decoding, and perform a thorough statistical analysis with uncertainty quantification. Across 90 tasks spanning reaction feasibility, metabolic compatibility, and toxicity screening, MedRule-KG reduces violation counts by 83.2\% relative to a strong chain-of-thought baseline while improving exact match. Results remain stable under stratification and scale with dataset size, and the verifier adds negligible latency, making the approach practical for interactive design.
中文标题/摘要
标题:MedRule-KG:一种知识图谱导向的框架,用于可靠地进行数学和生物医学推理
我们研究如何在用于科学推理和药物发现早期阶段的大规模语言模型(LLMs)中施加领域一致的结构。我们提出了MedRule-KG,这是一种紧凑的知识图谱框架,配有一个轻量级验证器,引导生成符合数学和生物医学有效输出的内容。该系统将经过筛选的符号事实注入提示,然后使用确定性检查器强制执行规则满足。我们将生成视为受限推理,引入了适合解码的软指导替代方案,并进行了彻底的统计分析,包括不确定性量化。在涉及反应可行性、代谢兼容性和毒性筛查的90个任务中,MedRule-KG相对于强大的链式思考基线减少了83.2%的违反次数,同时提高了精确匹配率。结果在分层分析中保持稳定,并随着数据集大小的增加而扩展,验证器增加了几乎可以忽略的延迟,使该方法适用于交互式设计。
Summary / 总结
The research aims to enhance the reliability of large language models (LLMs) in scientific reasoning and drug discovery by incorporating domain-specific knowledge. MedRule-KG uses a compact knowledge graph and a lightweight verifier to guide the generation of mathematically and biomedically valid outputs. The system reduces violation counts by 83.2% compared to a strong chain-of-thought baseline while improving exact match accuracy. The approach is stable and scalable, with minimal latency added by the verifier, making it practical for interactive design applications.
研究旨在通过引入领域特定知识来提高大型语言模型(LLMs)在科学推理和药物发现中的可靠性。MedRule-KG 使用紧凑的知识图谱和轻量级验证器来引导生成数学和生物医学上有效的输出。该系统将违反规则的数量减少了 83.2%,同时提高了精确匹配的准确性。该方法在分层分析中保持稳定,并且随着数据集规模的扩大而扩展,验证器的延迟几乎可以忽略不计,使其实用于交互式设计应用。
MedRule-KG: A Knowledge-Graph--Steered Scaffold for Mathematical Reasoning with a Lightweight Verifier
Authors: Crystal Su
First: 2025-10-18T02:39:13+00:00 · Latest: 2025-12-12T16:08:36+00:00
Comments: This paper is withdrawn due to issues with attribution and citation accuracy
Abstract
Large language models (LLMs) often produce fluent reasoning steps while violating simple mathematical or logical constraints. We introduce MedRule-KG, a compact typed knowledge graph coupled with a symbolic verifier, designed to enforce mathematically interpretable rules in reasoning tasks. MedRule-KG encodes entities, relations, and three domain-inspired rules, while the verifier checks predictions and applies minimal corrections to guarantee consistency. On a 90-example FDA-derived benchmark, grounding in MedRule-KG improves exact match (EM) from 0.767 to 0.900, and adding the verifier yields 1.000 EM while eliminating rule violations entirely. We demonstrate how MedRule-KG provides a general scaffold for safe mathematical reasoning, discuss ablations, and release code and data to encourage reproducibility.
中文标题/摘要
标题:MedRule-KG:一种由知识图谱引导的轻量级验证器支撑结构,用于数学推理
大型语言模型(LLMs)通常会产生流畅的推理步骤,但违反了简单的数学或逻辑约束。我们引入了MedRule-KG,这是一种紧凑的类型化知识图谱,结合了一个符号验证器,旨在在推理任务中强制执行可解释的数学规则。MedRule-KG 编码实体、关系和三个领域启发式规则,而验证器检查预测并应用最小的修正以确保一致性。在由FDA衍生的90个示例基准测试中,基于MedRule-KG 的准确匹配率(EM)从0.767提高到0.900,添加验证器后EM达到1.000,同时完全消除了规则违反。我们展示了MedRule-KG 如何提供一个通用的框架以确保数学推理的安全性,讨论了消融实验,并发布了代码和数据以促进可重复性。
Summary / 总结
MedRule-KG is a knowledge graph-based system that includes a symbolic verifier to ensure mathematical reasoning is accurate and consistent. It improves exact match scores from 0.767 to 0.900 on a benchmark and achieves 1.000 exact match with the verifier, eliminating rule violations. The system provides a general framework for safe mathematical reasoning and includes code and data for reproducibility.
MedRule-KG 是一个基于知识图谱的系统,结合了符号验证器以确保推理任务中的数学一致性。它在基准测试中的精确匹配分数从 0.767 提高到 0.900,并且通过验证器达到 1.000 的精确匹配,同时完全消除规则违规。该系统提供了一种安全数学推理的一般框架,并提供了代码和数据以促进可重复性。
Depth-Copy-Paste: Multimodal and Depth-Aware Compositing for Robust Face Detection
Authors: Qiushi Guo
First: 2025-12-12T16:02:42+00:00 · Latest: 2025-12-12T16:02:42+00:00
Abstract
Data augmentation is crucial for improving the robustness of face detection systems, especially under challenging conditions such as occlusion, illumination variation, and complex environments. Traditional copy paste augmentation often produces unrealistic composites due to inaccurate foreground extraction, inconsistent scene geometry, and mismatched background semantics. To address these limitations, we propose Depth Copy Paste, a multimodal and depth aware augmentation framework that generates diverse and physically consistent face detection training samples by copying full body person instances and pasting them into semantically compatible scenes. Our approach first employs BLIP and CLIP to jointly assess semantic and visual coherence, enabling automatic retrieval of the most suitable background images for the given foreground person. To ensure high quality foreground masks that preserve facial details, we integrate SAM3 for precise segmentation and Depth-Anything to extract only the non occluded visible person regions, preventing corrupted facial textures from being used in augmentation. For geometric realism, we introduce a depth guided sliding window placement mechanism that searches over the background depth map to identify paste locations with optimal depth continuity and scale alignment. The resulting composites exhibit natural depth relationships and improved visual plausibility. Extensive experiments show that Depth Copy Paste provides more diverse and realistic training data, leading to significant performance improvements in downstream face detection tasks compared with traditional copy paste and depth free augmentation methods.
中文标题/摘要
标题:深度复制粘贴:多模态和深度感知合成以提高鲁棒性面部检测
数据增强对于提高面部检测系统的鲁棒性至关重要,尤其是在遮挡、光照变化和复杂环境等挑战性条件下。传统的复制粘贴增强通常会产生不现实的合成图像,因为前景提取不准确、场景几何不一致和背景语义不匹配。为了解决这些限制,我们提出了一种多模态和深度感知增强框架——深度复制粘贴,通过复制全身人体实例并将其粘贴到语义兼容的场景中,生成多样且物理上一致的面部检测训练样本。我们的方法首先使用BLIP和CLIP联合评估语义和视觉一致性,从而实现自动检索最适合给定前景人体的背景图像。为了保留面部细节并确保高质量的前景掩码,我们结合了SAM3进行精确分割,并使用Depth-Anything仅提取未被遮挡的可见人体区域,防止在增强中使用损坏的面部纹理。为了实现几何现实感,我们引入了一种基于深度的滑动窗口放置机制,在背景深度图中搜索最佳的粘贴位置,以实现最佳的深度连续性和比例对齐。结果合成图像表现出自然的深度关系和增强的视觉合理性。大量实验表明,深度复制粘贴提供了比传统复制粘贴和无深度增强方法更多样和现实的训练数据,从而在下游面部检测任务中取得了显著的性能提升。
Summary / 总结
The paper aims to enhance the robustness of face detection systems by addressing the limitations of traditional data augmentation methods. It introduces Depth Copy Paste, a multimodal and depth-aware framework that generates realistic composites by copying full-body person instances and pasting them into semantically compatible scenes. Key findings show that this method provides more diverse and realistic training data, resulting in significant performance improvements in face detection tasks compared to traditional and depth-free augmentation methods.
研究旨在通过解决传统复制粘贴数据增强的局限性,增强在挑战性条件下的人脸检测系统的鲁棒性。提出的深度复制粘贴框架使用多模态和深度感知技术,通过复制全身人体实例并粘贴到语义兼容的场景中生成逼真的合成图像。关键发现包括在人脸检测任务中性能的显著提升,与传统方法相比,生成了更多多样性和现实感更强的训练数据。
MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition
Authors: Tim Cofala, Christian Kalfar, Jingge Xiao, Johanna Schrader, Michelle Tang, Wolfgang Nejdl
First: 2025-12-12T16:01:48+00:00 · Latest: 2025-12-12T16:01:48+00:00
Comments: 7 pages, 3 figures
Abstract
Therapeutic decision-making in clinical medicine constitutes a high-stakes domain in which AI guidance interacts with complex interactions among patient characteristics, disease processes, and pharmacological agents. Tasks such as drug recommendation, treatment planning, and adverse-effect prediction demand robust, multi-step reasoning grounded in reliable biomedical knowledge. Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG). TxAgent employs a fine-tuned Llama-3.1-8B model that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse), integrating FDA Drug API, OpenTargets, and Monarch resources to ensure access to current therapeutic information. In contrast to general-purpose RAG systems, medical applications impose stringent safety constraints, rendering the accuracy of both the reasoning trace and the sequence of tool invocations critical. These considerations motivate evaluation protocols treating token-level reasoning and tool-usage behaviors as explicit supervision signals. This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems using metrics that assess correctness, tool utilization, and reasoning quality. We analyze how retrieval quality for function (tool) calls influences overall model performance and demonstrate performance gains achieved through improved tool-retrieval strategies. Our work was awarded the Excellence Award in Open Science. Complete information can be found at https://curebench.ai/.
中文标题/摘要
标题:MedAI:在NeurIPS CURE-Bench竞赛中评估TxAgent的治疗代理推理
临床医学中的治疗决策构成一个高风险领域,在此领域中,AI指导与患者特征、疾病过程和药物剂型之间的复杂相互作用进行互动。诸如药物推荐、治疗规划和不良反应预测等任务需要基于可靠的生物医学知识的稳健、多步推理。代理AI方法,如TxAgent,通过迭代检索增强生成(RAG)来应对这些挑战。TxAgent 使用微调后的 Llama-3.1-8B 模型,动态生成并执行对统一生物医学工具套件(ToolUniverse)的功能调用,整合FDA药物API、OpenTargets和Monarch资源,以确保获取当前的治疗信息。与通用RAG系统不同,医疗应用施加了严格的安全部署约束,因此推理轨迹的准确性和工具调用序列的准确性至关重要。这些考虑促使评估协议将标记级推理和工具使用行为视为明确的监督信号。本研究介绍了我们参加CURE-Bench NeurIPS 2025挑战所获得的见解,该挑战使用评估治疗推理系统的指标来评估正确性、工具使用和推理质量。我们分析了功能(工具)调用检索质量对整体模型性能的影响,并展示了通过改进工具检索策略所实现的性能提升。我们的工作获得了开放科学卓越奖。更多信息请参见https://curebench.ai/。
Summary / 总结
This study evaluates TxAgent's therapeutic reasoning in the NeurIPS CURE-Bench competition, focusing on its ability to generate and execute function calls to a biomedical tool suite for drug recommendation and adverse-effect prediction. The method involves using a fine-tuned Llama-3.1-8B model that integrates FDA Drug API, OpenTargets, and Monarch resources. Key findings show that improving retrieval quality for function calls enhances overall model performance, and the work received the Excellence Award in Open Science for its evaluation approach.
该研究评估了TxAgent在NeurIPS CURE-Bench竞赛中的治疗性代理推理能力。TxAgent使用一个微调后的Llama-3.1-8B模型生成和执行功能调用,整合了FDA药物API、OpenTargets和Monarch资源。研究重点在于推理和工具使用的准确性,表明改进的功能检索策略可以提升整体模型性能。该工作获得了开放科学卓越奖。
Cross-modal Context-aware Learning for Visual Prompt Guided Multimodal Image Understanding in Remote Sensing
Authors: Xu Zhang, Jiabin Fang, Zhuoming Ding, Jin Yuan, Xuan Liu, Qianjun Zhang, Zhiyong Li
First: 2025-12-12T15:59:49+00:00 · Latest: 2025-12-12T15:59:49+00:00
Comments: 12 pages, 5 figures
Abstract
Recent advances in image understanding have enabled methods that leverage large language models for multimodal reasoning in remote sensing. However, existing approaches still struggle to steer models to the user-relevant regions when only simple, generic text prompts are available. Moreover, in large-scale aerial imagery many objects exhibit highly similar visual appearances and carry rich inter-object relationships, which further complicates accurate recognition. To address these challenges, we propose Cross-modal Context-aware Learning for Visual Prompt-Guided Multimodal Image Understanding (CLV-Net). CLV-Net lets users supply a simple visual cue, a bounding box, to indicate a region of interest, and uses that cue to guide the model to generate correlated segmentation masks and captions that faithfully reflect user intent. Central to our design is a Context-Aware Mask Decoder that models and integrates inter-object relationships to strengthen target representations and improve mask quality. In addition, we introduce a Semantic and Relationship Alignment module: a Cross-modal Semantic Consistency Loss enhances fine-grained discrimination among visually similar targets, while a Relationship Consistency Loss enforces alignment between textual relations and visual interactions. Comprehensive experiments on two benchmark datasets show that CLV-Net outperforms existing methods and establishes new state-of-the-art results. The model effectively captures user intent and produces precise, intention-aligned multimodal outputs.
中文标题/摘要
标题:遥感多模态图像理解中的跨模态上下文感知学习用于视觉提示引导
图像理解的最新进展使方法能够利用大型语言模型进行遥感多模态推理。然而,现有的方法在仅提供简单的通用文本提示时,仍然难以引导模型关注用户相关区域。此外,在大规模航空图像中,许多对象具有高度相似的视觉外观,并携带丰富的对象间关系,这进一步增加了准确识别的复杂性。为了解决这些挑战,我们提出了跨模态上下文感知学习用于视觉提示引导的多模态图像理解(CLV-Net)。CLV-Net 允许用户提供一个简单的视觉提示,一个边界框,以指示感兴趣的区域,并使用该提示引导模型生成与用户意图一致的相关分割掩码和描述。我们设计的核心是上下文感知掩码解码器,它建模并整合对象间关系以增强目标表示并提高掩码质量。此外,我们引入了语义和关系对齐模块:跨模态语义一致性损失增强了视觉上相似目标之间的细粒度区分,而关系一致性损失强制文本关系与视觉交互之间的对齐。在两个基准数据集上的全面实验表明,CLV-Net 超过了现有方法并建立了新的最先进的结果。该模型有效地捕捉了用户意图并产生了精确、意图一致的多模态输出。
Summary / 总结
The paper addresses the challenge of guiding models to relevant regions in remote sensing images using simple text prompts. It introduces CLV-Net, which uses a visual cue (bounding box) to generate accurate segmentation masks and captions. Key components include a Context-Aware Mask Decoder and a Semantic and Relationship Alignment module with two loss functions. Experiments show CLV-Net outperforms existing methods and sets new state-of-the-art results on benchmark datasets.
研究旨在通过解决仅用简单文本提示引导模型的问题,提高遥感中的多模态图像理解。CLV-Net 提出了上下文感知掩码解码器和语义与关系对齐模块来提升模型性能。实验表明,CLV-Net 在现有方法中表现更优,达到了新的最佳水平,有效捕捉用户意图并生成精确的多模态输出。
The Emergence of Complex Behavior in Large-Scale Ecological Environments
Authors: Joseph Bejjani, Chase Van Amburg, Chengrui Wang, Chloe Huangyuan Su, Sarah M. Pratt, Yasin Mazloumi, Naeem Khoshnevis, Sham M. Kakade, Kianté Brantley, Aaron Walsman
First: 2025-10-21T02:03:25+00:00 · Latest: 2025-12-12T15:48:59+00:00
Comments: 33 pages, 23 figures, 12 tables, experiment code available at https://github.com/jbejjani2022/ecological-emergent-behavior
Abstract
We explore how physical scale and population size shape the emergence of complex behaviors in open-ended ecological environments. In our setting, agents are unsupervised and have no explicit rewards or learning objectives but instead evolve over time according to reproduction, mutation, and selection. As they act, agents also shape their environment and the population around them in an ongoing dynamic ecology. Our goal is not to optimize a single high-performance policy, but instead to examine how behaviors emerge and evolve across large populations due to natural competition and environmental pressures. We use modern hardware along with a new multi-agent simulator to scale the environment and population to sizes much larger than previously attempted, reaching populations of over 60,000 agents, each with their own evolved neural network policy. We identify various emergent behaviors such as long-range resource extraction, vision-based foraging, and predation that arise under competitive and survival pressures. We examine how sensing modalities and environmental scale affect the emergence of these behaviors and find that some of them appear only in sufficiently large environments and populations, and that larger scales increase the stability and consistency of these emergent behaviors. While there is a rich history of research in evolutionary settings, our scaling results on modern hardware provide promising new directions to explore ecology as an instrument of machine learning in an era of increasingly abundant computational resources and efficient machine frameworks. Experimental code is available at https://github.com/jbejjani2022/ecological-emergent-behavior.
中文标题/摘要
标题:大型生态环境中复杂行为的涌现
我们探讨了物理尺度和种群规模如何塑造开放生态环境中复杂行为的涌现。在我们的设定中,代理是未监督的,没有明确的奖励或学习目标,而是随着时间的推移通过繁殖、突变和选择而进化。随着代理的行动,它们也在不断动态的生态中塑造其环境和周围的人口。我们的目标不是优化单一的高性能策略,而是研究由于自然竞争和环境压力,复杂行为如何在大规模种群中涌现和进化。我们使用现代硬件和新的多代理模拟器来扩展环境和种群规模,达到超过60,000个代理,每个代理都有自己的进化神经网络策略。我们识别出各种涌现行为,如长距离资源提取、基于视觉的觅食和捕食,这些行为在竞争和生存压力下出现。我们研究了感知模态和环境规模如何影响这些行为的涌现,并发现其中一些行为仅在足够大的环境中和种群中出现,而更大的规模增加了这些涌现行为的稳定性和一致性。尽管在进化设置中已有丰富的研究历史,但现代硬件上的扩展结果为将生态学作为机器学习工具提供了新的探索方向,在计算资源日益丰富和高效机器框架的时代。实验代码可在https://github.com/jbejjani2022/ecological-emergent-behavior获取。
Summary / 总结
The study investigates how physical scale and population size influence the emergence of complex behaviors in ecological environments. Using a multi-agent simulator and modern hardware, the researchers evolved over 60,000 agents with neural network policies, observing behaviors like long-range resource extraction and predation. The findings show that these behaviors emerge more stably and consistently in larger environments and populations, highlighting the importance of scale in ecological dynamics and machine learning.
研究通过无监督的代理在时间上的进化(通过繁殖、突变和选择),探讨物理规模和种群大小如何影响生态环境中复杂行为的出现。使用现代硬件上的新型多代理模拟器,研究达到了超过60,000个代理的规模,每个代理都有自己的进化神经网络策略。关键发现包括长距离资源提取、基于视觉的觅食和捕食等行为的出现,这些行为在更大的环境和种群中更为稳定和一致。研究强调了感知模态和环境规模在这些行为发展中扮演的角色,提出了在计算资源丰富和高效机器框架时代,使用生态学作为机器学习工具的新方向。
Bridging Streaming Continual Learning via In-Context Large Tabular Models
Authors: Afonso Lourenço, João Gama, Eric P. Xing, Goreti Marreiros
Venue: AAAI
First: 2025-12-12T15:47:26+00:00 · Latest: 2025-12-12T15:47:26+00:00
Comments: Streaming Continual Learning AAAI Bridge 2026
Abstract
In streaming scenarios, models must learn continuously, adapting to concept drifts without erasing previously acquired knowledge. However, existing research communities address these challenges in isolation. Continual Learning (CL) focuses on long-term retention and mitigating catastrophic forgetting, often without strict real-time constraints. Stream Learning (SL) emphasizes rapid, efficient adaptation to high-frequency data streams, but typically neglects forgetting. Recent efforts have tried to combine these paradigms, yet no clear algorithmic overlap exists. We argue that large in-context tabular models (LTMs) provide a natural bridge for Streaming Continual Learning (SCL). In our view, unbounded streams should be summarized on-the-fly into compact sketches that can be consumed by LTMs. This recovers the classical SL motivation of compressing massive streams with fixed-size guarantees, while simultaneously aligning with the experience-replay desiderata of CL. To clarify this bridge, we show how the SL and CL communities implicitly adopt a divide-to-conquer strategy to manage the tension between plasticity (performing well on the current distribution) and stability (retaining past knowledge), while also imposing a minimal complexity constraint that motivates diversification (avoiding redundancy in what is stored) and retrieval (re-prioritizing past information when needed). Within this perspective, we propose structuring SCL with LTMs around two core principles of data selection for in-context learning: (1) distribution matching, which balances plasticity and stability, and (2) distribution compression, which controls memory size through diversification and retrieval mechanisms.
中文标题/摘要
标题:通过大型表格模型实现流式连续学习的桥梁
在流式场景中,模型必须持续学习,适应概念漂移而不抹去之前获得的知识。然而,现有的研究社区在解决这些挑战时是孤立的。连续学习(CL)侧重于长期保留并缓解灾难性遗忘,通常没有严格的实时约束。流式学习(SL)强调快速、高效地适应高频数据流,但通常忽视遗忘。最近的努力试图将这些范式结合起来,但没有明确的算法重叠。我们认为,大型上下文中的表格模型(LTMs)为流式连续学习(SCL)提供了一种自然的桥梁。在我们的观点中,无界的流应该实时总结为紧凑的草图,可以被LTMs消费。这恢复了经典SL动机中的压缩大规模流的固定大小保证,同时与CL的经验回放需求保持一致。为了阐明这种桥梁,我们展示了SL和CL社区如何隐式采用一种分而治之的策略来管理塑性(在当前分布上表现良好)和稳定性(保留过去知识)之间的张力,同时施加一个最小的复杂性约束,这激励了多样化(避免存储中的冗余)和检索(在需要时重新优先考虑过去的信息)。从这个角度来看,我们建议用LTMs围绕数据选择的两个核心原则来结构化SCL:(1)分布匹配,平衡塑性和稳定性;(2)分布压缩,通过多样化和检索机制控制内存大小。
Summary / 总结
The research aims to address the challenges of Streaming Continual Learning (SCL) by bridging the gap between Continual Learning (CL) and Stream Learning (SL) through the use of large in-context tabular models (LTMs). The method involves summarizing unbounded data streams into compact sketches that can be processed by LTMs, combining the efficiency of SL with the knowledge retention of CL. Key findings show that this approach effectively manages the trade-off between plasticity and stability, and controls memory size through diversification and retrieval mechanisms, providing a natural solution for SCL.
论文通过提出使用大型上下文表型模型(LTMs)来解决流式连续学习(SCL)的挑战,旨在弥合连续学习(CL)和流式学习(SL)之间的差距。它认为LTMs可以将无界数据流压缩成紧凑的摘要,同时满足SL和CL的目标。关键实验结果表明,LTMs能够有效管理弹性和稳定性的平衡,并通过多样性和检索机制控制内存大小,从而在SCL场景中提高性能。
From Verification Burden to Trusted Collaboration: Design Goals for LLM-Assisted Literature Reviews
Authors: Brenda Nogueira, Werner Geyer, Andrew Anderson, Toby Jia-Jun Li, Dongwhi Kim, Nuno Moniz, Nitesh V. Chawla
First: 2025-12-12T15:38:34+00:00 · Latest: 2025-12-12T15:38:34+00:00
Abstract
Large Language Models (LLMs) are increasingly embedded in academic writing practices. Although numerous studies have explored how researchers employ these tools for scientific writing, their concrete implementation, limitations, and design challenges within the literature review process remain underexplored. In this paper, we report a user study with researchers across multiple disciplines to characterize current practices, benefits, and \textit{pain points} in using LLMs to investigate related work. We identified three recurring gaps: (i) lack of trust in outputs, (ii) persistent verification burden, and (iii) requiring multiple tools. This motivates our proposal of six design goals and a high-level framework that operationalizes them through improved related papers visualization, verification at every step, and human-feedback alignment with generation-guided explanations. Overall, by grounding our work in the practical, day-to-day needs of researchers, we designed a framework that addresses these limitations and models real-world LLM-assisted writing, advancing trust through verifiable actions and fostering practical collaboration between researchers and AI systems.
中文标题/摘要
标题:从验证负担到信任合作:LLM辅助文献综述的设计目标
大型语言模型(LLMs)越来越多地嵌入到学术写作实践中。尽管已有许多研究探讨了研究人员如何使用这些工具进行科学写作,但它们在文献综述过程中的具体实施、局限性和设计挑战仍较少被研究。在本文中,我们报告了一项跨学科研究人员的用户研究,以描述使用LLMs进行相关工作研究的当前实践、益处和\textit{痛点}。我们确定了三个反复出现的缺口:(i) 对输出缺乏信任,(ii) 持续的验证负担,(iii) 需要多种工具。这促使我们提出六项设计目标和一个高层次框架,通过改进相关论文可视化、每一步验证和人类反馈与生成引导解释的对齐来实现它们。总体而言,通过将我们的工作基于研究人员的实际、日常需求,我们设计了一个框架来解决这些局限性,并通过可验证的行为促进信任,促进研究人员和AI系统之间的实际合作。
Summary / 总结
This paper addresses the challenges of using Large Language Models (LLMs) in academic literature reviews, focusing on trust, verification burden, and the need for multiple tools. Through a user study, the authors identified three main issues: lack of trust in outputs, persistent verification burden, and the requirement for multiple tools. To address these, they propose six design goals and a framework that enhances visualization of related papers, ensures verification at each step, and aligns human feedback with generation-guided explanations, thereby fostering practical collaboration between researchers and AI systems.
本文探讨了在学术文献综述中使用大型语言模型(LLMs)的挑战和益处。通过一项用户研究,作者发现了三个主要问题:对LLM输出缺乏信任、持续的验证负担以及需要多种工具。为了解决这些问题,他们提出了六个设计目标和一个框架,该框架通过增强相关论文的可视化、在每一步确保验证以及将人类反馈与生成解释对齐来提高信任度,从而促进研究人员与AI系统的实际合作。
Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation
Authors: Luca Cazzola, Ahed Alboody
First: 2025-12-12T15:32:28+00:00 · Latest: 2025-12-12T15:32:28+00:00
Abstract
The acquisition cost for large, annotated motion datasets remains a critical bottleneck for skeletal-based Human Activity Recognition (HAR). Although Text-to-Motion (T2M) generative models offer a compelling, scalable source of synthetic data, their training objectives, which emphasize general artistic motion, and dataset structures fundamentally differ from HAR's requirements for kinematically precise, class-discriminative actions. This disparity creates a significant domain gap, making generalist T2M models ill-equipped for generating motions suitable for HAR classifiers. To address this challenge, we propose KineMIC (Kinetic Mining In Context), a transfer learning framework for few-shot action synthesis. KineMIC adapts a T2M diffusion model to an HAR domain by hypothesizing that semantic correspondences in the text encoding space can provide soft supervision for kinematic distillation. We operationalize this via a kinetic mining strategy that leverages CLIP text embeddings to establish correspondences between sparse HAR labels and T2M source data. This process guides fine-tuning, transforming the generalist T2M backbone into a specialized few-shot Action-to-Motion generator. We validate KineMIC using HumanML3D as the source T2M dataset and a subset of NTU RGB+D 120 as the target HAR domain, randomly selecting just 10 samples per action class. Our approach generates significantly more coherent motions, providing a robust data augmentation source that delivers a +23.1% accuracy points improvement. Animated illustrations and supplementary materials are available at (https://lucazzola.github.io/publications/kinemic).
中文标题/摘要
标题:运动语境下的动能采矿:基于文本到运动蒸馏的少样本动作合成
大型注释运动数据集的获取成本仍然是基于骨架的人体活动识别(HAR)的关键瓶颈。尽管文本到运动(T2M)生成模型提供了具有吸引力且可扩展的合成数据来源,但它们的训练目标强调一般艺术运动,而数据集结构与HAR对精确的、类区分的动作的要求根本不同。这种差异造成了显著的领域差距,使得通用的T2M模型无法生成适合HAR分类器的动作。为了解决这一挑战,我们提出了KineMIC(动能采矿在上下文中的应用),一种少样本动作合成的迁移学习框架。KineMIC通过假设文本编码空间中的语义对应可以为运动学蒸馏提供软监督,将T2M扩散模型适应到HAR领域。我们通过一种动能采矿策略,利用CLIP文本嵌入来建立稀疏HAR标签与T2M源数据之间的对应关系,从而指导微调,将通用的T2M主干转化为专门的少样本动作到运动生成器。我们使用HumanML3D作为源T2M数据集,NTU RGB+D 120的部分作为目标HAR领域,随机选择每个动作类别的10个样本。我们的方法生成了更加连贯的动作,提供了一个稳健的数据增强来源,提高了23.1%的准确率。动画示例和补充材料可在(https://lucazzola.github.io/publications/kinemic)获取。
Summary / 总结
The paper addresses the challenge of generating kinematically precise actions suitable for Human Activity Recognition (HAR) classifiers using Text-to-Motion (T2M) models. It proposes KineMIC, a transfer learning framework that adapts a generalist T2M diffusion model to the HAR domain by leveraging semantic correspondences in text embeddings. Experiments show that KineMIC generates more coherent motions, improving HAR accuracy by 23.1% using only 10 samples per action class from NTU RGB+D 120 and HumanML3D as the source dataset.
论文提出了一种名为KineMIC的迁移学习框架,用于少量样本的动作合成,以解决人体活动识别(HAR)中大规模标注动作数据集的获取难题。KineMIC通过CLIP文本嵌入来建立语义对应关系,指导模型微调,生成更连贯的动作,相比通用的T2M模型,HAR准确率提高了23.1%。
An effective control of large systems of active particles: An application to evacuation problem
Authors: Albina Klepach, Egor E. Nuzhin, Alexey A. Tsukanov, Nikolay V. Brilliantov
First: 2025-09-24T10:27:45+00:00 · Latest: 2025-12-12T14:51:16+00:00
Abstract
Manipulation of large systems of active particles is a serious challenge across diverse domains, including crowd management, control of robotic swarms, and coordinated material transport. The development of advanced control strategies for complex scenarios is hindered, however, by the lack of scalability and robustness of the existing methods, in particular, due to the need of an individual control for each agent. One possible solution involves controlling a system through a leader or a group of leaders, which other agents tend to follow. Using such an approach we develop an effective control strategy for a leader, combining reinforcement learning (RL) with artificial forces acting on the system. To describe the guidance of active particles by a leader we introduce the generalized Vicsek model. This novel method is then applied to the problem of the effective evacuation by a robot-rescuer (leader) of large groups of people from hazardous places. We demonstrate, that while a straightforward application of RL yields suboptimal results, even for advanced architectures, our approach provides a robust and efficient evacuation strategy. The source code supporting this study is publicly available at: https://github.com/cinemere/evacuation.
中文标题/摘要
标题:大型活性粒子系统的有效控制:以疏散问题为例
对大型活性粒子系统的操控在多个领域都是一项严峻的挑战,包括人群管理、机器人群的控制以及协调物质运输等。然而,由于现有方法缺乏可扩展性和鲁棒性,特别是在需要为每个代理单独控制的情况下,开发适用于复杂场景的高级控制策略受到了阻碍。一种可能的解决方案是通过领导者或一组领导者来控制系统,其他代理倾向于跟随领导者。使用这种方法,我们开发了一种结合强化学习(RL)和作用于系统的虚拟力的有效控制策略。为了描述领导者对活性粒子的引导,我们引入了广义维谢克模型。然后,我们将这种方法应用于机器人救援者(领导者)有效疏散大量人群的问题。我们证明,即使对于先进的架构,直接应用RL也会导致次优结果,而我们的方法则提供了一种稳健且高效的疏散策略。支持本研究的源代码可在以下网址获取:https://github.com/cinemere/evacuation.
Summary / 总结
The paper addresses the challenge of controlling large systems of active particles, such as crowds or robotic swarms, by developing a robust control strategy using reinforcement learning combined with artificial forces. The method is applied to the evacuation problem, where a robot-rescuer (leader) guides people to safety. While traditional RL methods yield suboptimal results, the proposed approach provides a more efficient and robust evacuation strategy. The source code is available publicly.
论文旨在通过开发可扩展的控制策略来解决大型活性粒子系统(如人群或机器人集群)的控制难题。它引入了一种结合强化学习和人工力的方法,并将其应用于救援机器人(领导者)引导人们从危险区域安全撤离的问题。研究显示,虽然简单的强化学习方法效果不佳,但提出的策略能够提供一种稳健且高效的撤离方案。
History
20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553