arXiv 论文速递

Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

Authors: Yang Fei, George Stoica, Jingyuan Liu, Qifeng Chen, Ranjay Krishna, Xiaojuan Wang, Benlin Liu

First: 2025-12-12T18:56:35+00:00 · Latest: 2025-12-12T18:56:35+00:00

Comments: Project Website: https://sam2videox.github.io/

Abstract

Reality is a dance between rigid constraints and deformable structures. For video models, that means generating motion that preserves fidelity as well as structure. Despite progress in diffusion models, producing realistic structure-preserving motion remains challenging, especially for articulated and deformable objects such as humans and animals. Scaling training data alone, so far, has failed to resolve physically implausible transitions. Existing approaches rely on conditioning with noisy motion representations, such as optical flow or skeletons extracted using an external imperfect model. To address these challenges, we introduce an algorithm to distill structure-preserving motion priors from an autoregressive video tracking model (SAM2) into a bidirectional video diffusion model (CogVideoX). With our method, we train SAM2VideoX, which contains two innovations: (1) a bidirectional feature fusion module that extracts global structure-preserving motion priors from a recurrent model like SAM2; (2) a Local Gram Flow loss that aligns how local features move together. Experiments on VBench and in human studies show that SAM2VideoX delivers consistent gains (+2.60\% on VBench, 21-22\% lower FVD, and 71.4\% human preference) over prior baselines. Specifically, on VBench, we achieve 95.51\%, surpassing REPA (92.91\%) by 2.60\%, and reduce FVD to 360.57, a 21.20\% and 22.46\% improvement over REPA- and LoRA-finetuning, respectively. The project website can be found at https://sam2videox.github.io/ .

中文标题/摘要

标题：从跟踪中推导结构：提炼保留结构的运动以生成视频

现实是刚性约束与可变形结构之间的舞蹈。对于视频模型来说，这意味着生成既保持保真度又保留结构的运动。尽管在扩散模型方面取得了进展，但生成真实且保留结构的运动仍然具有挑战性，尤其是对于如人类和动物等具有关节和可变形物体。迄今为止，仅扩大训练数据尚未解决物理上不合理的过渡问题。现有方法依赖于使用噪声运动表示进行条件处理，例如光学流或使用外部不完美模型提取的骨架。为了解决这些挑战，我们提出了一种算法，将来自自回归视频跟踪模型（SAM2）的结构保留运动先验提炼到双向视频扩散模型（CogVideoX）中。通过我们的方法，我们训练了SAM2VideoX，其中包含两项创新：（1）双向特征融合模块，从类似于SAM2的递归模型中提取全局结构保留运动先验；（2）局部格拉姆流损失，使局部特征的移动方式保持一致。在VBench上的实验和人类研究中，SAM2VideoX在VBench上实现了95.51%，超越了REPA（92.91%）2.60%，并将FVD降低到360.57，分别比REPA-和LoRA微调提高了21.20%和22.46%。项目网站可访问 https://sam2videox.github.io/ 。

Summary / 总结

This research aims to generate realistic motion in videos that preserves structure, addressing the challenges of physically plausible transitions for articulated and deformable objects. The method involves using an autoregressive video tracking model (SAM2) to extract structure-preserving motion priors, which are then distilled into a bidirectional video diffusion model (CogVideoX). Experiments show that SAM2VideoX outperforms prior baselines on VBench and in human studies, with significant improvements in consistency, FVD, and human preference scores. Specifically, SAM2VideoX achieves a VBench score of 95.51, surpassing REPA by 2.60% and reducing FVD by 21.20% and 22.46% compared to REPA and LoRA-finetuning, respectively.

研究旨在生成具有真实性和结构保持性的视频运动，解决现有扩散模型和噪声运动表示的局限性。方法包括使用自回归视频跟踪模型（SAM2）将结构保持的运动先验信息注入双向视频扩散模型（CogVideoX）。实验表明，SAM2VideoX 在 VBench 和人类偏好方面优于先前基线，实现 2.60% 的改进和 21-22% 低于 REPA 和 LoRA 微调的 FVD。

Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

Authors: Etienne Boursier, Claire Boyer

First: 2025-12-12T18:54:52+00:00 · Latest: 2025-12-12T18:54:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Softmax attention is a central component of transformer architectures, yet its nonlinear structure poses significant challenges for theoretical analysis. We develop a unified, measure-based framework for studying single-layer softmax attention under both finite and infinite prompts. For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure. Building on this insight, we establish non-asymptotic concentration bounds for the output and gradient of softmax attention, quantifying how rapidly the finite-prompt model approaches its infinite-prompt counterpart, and prove that this concentration remains stable along the entire training trajectory in general in-context learning settings with sub-Gaussian tokens. In the case of in-context linear regression, we use the tractable infinite-prompt dynamics to analyze training at finite prompt length. Our results allow optimization analyses developed for linear attention to transfer directly to softmax attention when prompts are sufficiently long, showing that large-prompt softmax attention inherits the analytical structure of its linear counterpart. This, in turn, provides a principled and broadly applicable toolkit for studying the training dynamics and statistical behavior of softmax attention layers in large prompt regimes.

中文标题/摘要

标题：Softmax作为大提示下的线性注意力：基于测度的观点

Softmax注意力是变压器架构中的核心组件，但其非线性结构给理论分析带来了重大挑战。我们开发了一个统一的基于测度的框架，用于研究在有限和无限提示下的单层softmax注意力。对于独立同分布的高斯输入，我们利用softmax操作在无限提示极限下收敛于作用于底层输入-标记测度的线性操作这一事实。基于这一洞察，我们建立了softmax注意力输出和梯度的非渐近收敛界，量化了有限提示模型如何迅速接近其无限提示对应物，并证明在一般子高斯标记的上下文学习环境中，这种收敛在整个训练轨迹上保持稳定。在上下文线性回归的情况下，我们利用可处理的无限提示动力学来分析有限提示长度下的训练。我们的结果允许针对线性注意力开发的优化分析直接应用于足够长提示下的softmax注意力，表明大提示下的softmax注意力继承了其线性对应物的分析结构。这反过来为研究softmax注意力层在大提示环境下的训练动力学和统计行为提供了一个原则性的且广泛适用的工具箱。

Summary / 总结

This paper develops a measure-based framework to study softmax attention in transformers, particularly focusing on its behavior in the large-prompt regime. By leveraging the convergence of softmax to a linear operator in the infinite-prompt limit, the authors establish concentration bounds for the output and gradients, showing that finite-prompt models approach their infinite-prompt counterparts rapidly and stably during training. These results facilitate the application of linear attention optimization analyses to softmax attention when prompts are sufficiently long, providing insights into the training dynamics and statistical behavior of softmax layers in large prompt settings.

论文开发了一种基于测度的方法来研究transformer中的softmax注意力在大提示长度下的行为。通过利用softmax在无限提示极限下收敛到线性算子的事实，作者建立了非渐近收敛界，并证明了这些界在整个训练过程中保持稳定。关键发现包括对于足够长的提示长度，可以从线性注意力直接转移优化分析到softmax注意力，从而提供了一个工具箱来理解大提示长度下softmax注意力层的训练动力学和统计行为。

Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously

Authors: Andrew Adiletta, Kathryn Adiletta, Kemal Derya, Berk Sunar

First: 2025-12-12T18:52:09+00:00 · Latest: 2025-12-12T18:52:09+00:00

Comments: 13 pages, 5 Figures

Abs · PDF · Code1 · Code2

Abstract

The rapid deployment of Large Language Models (LLMs) has created an urgent need for enhanced security and privacy measures in Machine Learning (ML). LLMs are increasingly being used to process untrusted text inputs and even generate executable code, often while having access to sensitive system controls. To address these security concerns, several companies have introduced guard models, which are smaller, specialized models designed to protect text generation models from adversarial or malicious inputs. In this work, we advance the study of adversarial inputs by introducing Super Suffixes, suffixes capable of overriding multiple alignment objectives across various models with different tokenization schemes. We demonstrate their effectiveness, along with our joint optimization technique, by successfully bypassing the protection mechanisms of Llama Prompt Guard 2 on five different text generation models for malicious text and code generation. To the best of our knowledge, this is the first work to reveal that Llama Prompt Guard 2 can be compromised through joint optimization. Additionally, by analyzing the changing similarity of a model's internal state to specific concept directions during token sequence processing, we propose an effective and lightweight method to detect Super Suffix attacks. We show that the cosine similarity between the residual stream and certain concept directions serves as a distinctive fingerprint of model intent. Our proposed countermeasure, DeltaGuard, significantly improves the detection of malicious prompts generated through Super Suffixes. It increases the non-benign classification rate to nearly 100%, making DeltaGuard a valuable addition to the guard model stack and enhancing robustness against adversarial prompt attacks.

中文标题/摘要

标题：超级后缀：同时绕过文本生成对齐和防护模型

大型语言模型（LLMs）的快速部署迫切需要在机器学习（ML）中增强安全和隐私措施。LLMs 越来越多地被用于处理不可信的文本输入，甚至生成可执行代码，同时拥有访问敏感系统控制的权限。为应对这些安全问题，多家公司引入了防护模型，这是一种较小的、专门设计的模型，旨在保护文本生成模型免受恶意或敌对输入的影响。在本文中，我们通过引入超级后缀推进了对抗输入的研究，超级后缀能够在不同分词方案的多种模型中同时覆盖多个对齐目标。我们通过成功绕过 Llama Prompt Guard 2 对五种不同文本生成模型的恶意文本和代码生成保护机制，展示了其有效性以及我们的联合优化技术。据我们所知，这是首次工作揭示 Llama Prompt Guard 2 可通过联合优化被攻破。此外，通过分析模型在处理标记序列过程中内部状态与特定概念方向相似性的变化，我们提出了一种有效且轻量的方法来检测超级后缀攻击。我们表明，残差流与某些概念方向之间的余弦相似度充当了模型意图的独特指纹。我们提出的对策 DeltaGuard 显著提高了对通过超级后缀生成的恶意提示的检测率，使其非良性分类率接近 100%，使 DeltaGuard 成为防护模型堆栈中的重要补充，增强了对抗敌对提示攻击的鲁棒性。

Summary / 总结

This paper addresses the security concerns of Large Language Models (LLMs) by introducing Super Suffixes, which can bypass the protection mechanisms of Llama Prompt Guard 2 across multiple text generation models. The authors demonstrate the effectiveness of Super Suffixes and their joint optimization technique by successfully bypassing Llama Prompt Guard 2 on five different models. They also propose DeltaGuard, a lightweight countermeasure that uses cosine similarity to detect Super Suffix attacks, significantly improving the detection rate of malicious prompts to nearly 100%.

该论文通过引入Super Suffixes解决了大型语言模型（LLMs）的安全问题，Super Suffixes能够绕过Llama Prompt Guard 2对多个文本生成模型的保护机制。作者通过在五个不同模型上成功绕过Llama Prompt Guard 2，展示了Super Suffixes及其联合优化技术的有效性。他们还提出了一种轻量级的防御措施DeltaGuard，通过计算余量流与特定概念方向之间的余弦相似度来检测Super Suffix攻击，显著提高了恶意提示的检测率至近100%。

Agile Flight Emerges from Multi-Agent Competitive Racing

Authors: Vineet Pasumarti, Lorenzo Bianchi, Antonio Loquercio

First: 2025-12-12T18:48:50+00:00 · Latest: 2025-12-12T18:48:50+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Through multi-agent competition and the sparse high-level objective of winning a race, we find that both agile flight (e.g., high-speed motion pushing the platform to its physical limits) and strategy (e.g., overtaking or blocking) emerge from agents trained with reinforcement learning. We provide evidence in both simulation and the real world that this approach outperforms the common paradigm of training agents in isolation with rewards that prescribe behavior, e.g., progress on the raceline, in particular when the complexity of the environment increases, e.g., in the presence of obstacles. Moreover, we find that multi-agent competition yields policies that transfer more reliably to the real world than policies trained with a single-agent progress-based reward, despite the two methods using the same simulation environment, randomization strategy, and hardware. In addition to improved sim-to-real transfer, the multi-agent policies also exhibit some degree of generalization to opponents unseen at training time. Overall, our work, following in the tradition of multi-agent competitive game-play in digital domains, shows that sparse task-level rewards are sufficient for training agents capable of advanced low-level control in the physical world. Code: https://github.com/Jirl-upenn/AgileFlight_MultiAgent

中文标题/摘要

标题：敏捷飞行源自多智能体竞速比赛

通过多智能体竞争和赢得比赛的稀疏高层目标，我们发现，通过强化学习训练的智能体不仅会产生敏捷飞行（例如，高速运动使平台达到物理极限）和策略（例如，超越或阻挡），而且在模拟和真实世界中，这种方法在复杂环境增加时（例如，存在障碍物时）比孤立训练智能体并用规定行为的奖励方法更胜一筹。此外，我们发现，多智能体竞争产生的策略在真实世界中的转移性比单智能体基于进度的奖励方法更可靠，尽管两种方法使用相同的模拟环境、随机化策略和硬件。除了改进的模拟到现实世界的转移性，多智能体策略还表现出一定程度的对未在训练中遇到的对手的泛化能力。总体而言，我们的工作，沿袭了数字领域多智能体竞争游戏的传统，表明稀疏的任务级奖励足以训练出能够在物理世界中执行高级低级控制的智能体。

Summary / 总结

The research aims to explore how multi-agent competition can lead to the emergence of agile flight and strategic behavior in racing scenarios through reinforcement learning. The method involves training agents in a competitive setting with a sparse high-level objective of winning a race, rather than detailed low-level rewards. Key experimental findings show that this approach outperforms single-agent training with progress-based rewards, especially in complex environments with obstacles. Additionally, multi-agent policies exhibit better sim-to-real transfer and some generalization to unseen opponents.

研究通过强化学习探索了在赛车模拟和真实环境中的多智能体竞争如何导致敏捷飞行和策略行为的出现。在竞争性训练设置中的智能体比孤立训练和基于进度奖励的智能体表现更好，尤其是在有障碍的复杂环境中。多智能体方法还展示了比单智能体训练方法更好的仿真到现实世界的转移和对未见过的对手的泛化能力。

Reducing Domain Gap with Diffusion-Based Domain Adaptation for Cell Counting

Authors: Mohammad Dehghanmanshadi, Wallapak Tavanapong

First: 2025-12-12T18:19:41+00:00 · Latest: 2025-12-12T18:19:41+00:00

Comments: Accepted at ICMLA 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Generating realistic synthetic microscopy images is critical for training deep learning models in label-scarce environments, such as cell counting with many cells per image. However, traditional domain adaptation methods often struggle to bridge the domain gap when synthetic images lack the complex textures and visual patterns of real samples. In this work, we adapt the Inversion-Based Style Transfer (InST) framework originally designed for artistic style transfer to biomedical microscopy images. Our method combines latent-space Adaptive Instance Normalization with stochastic inversion in a diffusion model to transfer the style from real fluorescence microscopy images to synthetic ones, while weakly preserving content structure. We evaluate the effectiveness of our InST-based synthetic dataset for downstream cell counting by pre-training and fine-tuning EfficientNet-B0 models on various data sources, including real data, hard-coded synthetic data, and the public Cell200-s dataset. Models trained with our InST-synthesized images achieve up to 37\% lower Mean Absolute Error (MAE) compared to models trained on hard-coded synthetic data, and a 52\% reduction in MAE compared to models trained on Cell200-s (from 53.70 to 25.95 MAE). Notably, our approach also outperforms models trained on real data alone (25.95 vs. 27.74 MAE). Further improvements are achieved when combining InST-synthesized data with lightweight domain adaptation techniques such as DACS with CutMix. These findings demonstrate that InST-based style transfer most effectively reduces the domain gap between synthetic and real microscopy data. Our approach offers a scalable path for enhancing cell counting performance while minimizing manual labeling effort. The source code and resources are publicly available at: https://github.com/MohammadDehghan/InST-Microscopy.

中文标题/摘要

标题：基于扩散模型的风格迁移方法减少领域差距以用于细胞计数

生成逼真的合成显微镜图像对于在标签稀缺环境中训练深度学习模型至关重要，例如每张图像中有许多细胞的细胞计数。然而，传统的领域适应方法在合成图像缺乏真实样本的复杂纹理和视觉模式时，往往难以弥合领域差距。在本工作中，我们将最初用于艺术风格迁移的反转基于风格迁移（InST）框架适应到生物医学显微镜图像中。我们的方法结合了潜在空间自适应实例归一化和扩散模型中的随机反转，将真实荧光显微镜图像的风格转移到合成图像上，同时弱地保留内容结构。我们通过在各种数据源上预训练和微调EfficientNet-B0模型来评估基于InST的合成数据集在下游细胞计数中的有效性，包括真实数据、硬编码的合成数据和公共Cell200-s数据集。使用我们InST合成图像训练的模型的平均绝对误差（MAE）比使用硬编码合成数据训练的模型低至37%，比使用Cell200-s训练的模型的MAE低52%（从53.70降至25.95 MAE）。值得注意的是，我们的方法在仅使用真实数据训练的模型（25.95 vs. 27.74 MAE）上也表现出色。通过结合InST合成数据和轻量级领域适应技术（如CutMix的DACS），可以进一步提高性能。这些发现表明，基于InST的风格迁移最有效地减少了合成和真实显微镜数据之间的领域差距。我们的方法提供了一种减少手动标注努力的同时增强细胞计数性能的可扩展途径。源代码和资源可在以下链接获取：https://github.com/MohammadDehghan/InST-Microscopy。

Summary / 总结

This study addresses the challenge of bridging the domain gap between synthetic and real microscopy images for cell counting tasks. It proposes an Inversion-Based Style Transfer (InST) framework that combines latent-space Adaptive Instance Normalization with stochastic inversion in a diffusion model to transfer the style from real images to synthetic ones while preserving content structure. The method is evaluated by pre-training and fine-tuning EfficientNet-B0 models on various datasets, showing up to 37% lower Mean Absolute Error (MAE) compared to models trained on hard-coded synthetic data and a 52% reduction in MAE compared to models trained on Cell200-s. Combining InST-synthesized data with lightweight domain adaptation techniques further improves performance. The approach effectively reduces the domain gap and enhances cell counting performance with minimal manual labeling effort.

该研究旨在解决合成和真实显微镜图像之间在细胞计数任务中的领域差距问题。作者将Inversion-Based Style Transfer (InST)框架应用于生物医学显微镜，使用潜空间Adaptive Instance Normalization和扩散模型中的随机反转来将真实图像的风格转移到合成图像中，同时保留内容结构。评估结果显示，使用InST合成图像训练的模型相比使用硬编码合成数据训练的模型可降低高达37%的Mean Absolute Error (MAE)，相比使用Cell200-s训练的模型可降低52%的MAE。该方法还优于仅使用真实数据训练的模型，并且与轻量级领域适应技术结合使用时可进一步提高性能。这项工作提供了一种减少手动标注努力的可扩展解决方案，以提升细胞计数性能。

SUMFORU: An LLM-Based Review Summarization Framework for Personalized Purchase Decision Support

Authors: Yuming Feng, Xinrui Jiang

First: 2025-12-12T18:05:52+00:00 · Latest: 2025-12-12T18:05:52+00:00

Comments: Code available at https://github.com/Harry20030331/SumForU

Abs · PDF · Code1 · Code2 · Code3

Abstract

Online product reviews contain rich but noisy signals that overwhelm users and hinder effective decision-making. Existing LLM-based summarizers remain generic and fail to account for individual preferences, limiting their practical utility. We propose SUMFORU, a steerable review summarization framework that aligns outputs with explicit user personas to support personalized purchase decisions. Our approach integrates a high-quality data pipeline built from the Amazon 2023 Review Dataset with a two-stage alignment procedure: (1) persona-aware Supervised Fine-Tuning (SFT) via asymmetric knowledge distillation, and (2) Reinforcement Learning with AI Feedback (RLAIF) using a preference estimator to capture fine-grained, persona-relevant signals. We evaluate the model across rule-based, LLM-based, and human-centered metrics, demonstrating consistent improvements in consistency, grounding, and preference alignment. Our framework achieves the highest performance across all evaluation settings and generalizes effectively to unseen product categories. Our results highlight the promise of steerable pluralistic alignment for building next-generation personalized decision-support systems.

中文标题/摘要

标题：SUMFORU：基于LLM的个性化购买决策支持评论总结框架

在线产品评论包含丰富的但杂乱的信号，使用户感到困惑并妨碍有效的决策制定。现有的基于LLM的总结器仍然具有通用性，未能考虑个人偏好，限制了其实用价值。我们提出SUMFORU，这是一种可引导的评论总结框架，能够与明确的用户人设对齐，以支持个性化的购买决策。我们的方法结合了从亚马逊2023评论数据集中构建的高质量数据管道，并采用两阶段对齐程序：(1) 通过不对称知识蒸馏进行具有人设意识的监督微调(SFT)，(2) 使用偏好估计器进行强化学习与AI反馈(RLAIF)。我们使用基于规则、基于LLM和基于人类中心的指标对模型进行了评估，展示了在一致性、定位和偏好对齐方面的持续改进。我们的框架在所有评估设置中均表现出最高的性能，并且能够有效泛化到未见过的产品类别。我们的结果突显了可引导的多元对齐在构建下一代个性化决策支持系统方面的潜力。

Summary / 总结

The paper addresses the challenge of overwhelming and noisy online product reviews by proposing SUMFORU, a personalized review summarization framework. It uses a two-stage alignment process involving persona-aware Supervised Fine-Tuning and Reinforcement Learning with AI Feedback to tailor summaries to individual user preferences. The model outperforms existing methods in consistency, grounding, and preference alignment across various evaluation metrics and generalizes well to new product categories.

论文提出SUMFORU，一种使用LLM的个性化评论摘要框架，以应对在线产品评论过多且嘈杂的问题。该框架采用两阶段对齐过程：基于人设的监督微调和基于AI反馈的强化学习，以捕捉用户偏好。实验结果显示，在各种指标上的一致改进，包括一致性、定位和偏好对齐，并且该框架能够很好地泛化到新的产品类别。

MTTR-A: Measuring Cognitive Recovery Latency in Multi-Agent Systems

Authors: Barak Or

First: 2025-11-08T21:29:18+00:00 · Latest: 2025-12-12T17:56:26+00:00

Comments: preprint

Abs · PDF · Code1 · Code2

Abstract

Ensuring cognitive stability in autonomous multi-agent systems (MAS) is a central challenge for large-scale, distributed AI. While existing observability tools monitor system outputs, they cannot quantify how rapidly agentic workflows recover once reasoning coherence has been lost. We adapt classical reliability metrics-Mean Time-to-Recovery (MTTR), Mean Time Between Failures (MTBF), and related ratios-into the cognitive domain, defining MTTR-A (Mean Time-to-Recovery for Agentic Systems) as a runtime measure of cognitive recovery latency. MTTR-A quantifies the time required for a MAS to detect reasoning drift and restore consistent operation, capturing the recovery of reasoning coherence rather than infrastructural repair. A benchmark simulation using the AG~News corpus and the LangGraph orchestration framework was conducted, modeling recovery latencies across multiple reflex modes. Automated reflexes restored stability within approximately 6s on average, while human-approval interventions required about 12s. Across 200 runs, the median simulated MTTR-A was 6.21+-2.14s, MTBF=6.7+-2.14s, and NRR=0.08, demonstrating measurable runtime resilience across reflex strategies. By formalizing recovery latency as a quantifiable property of distributed reasoning-and deriving reliability bounds linking recovery time and cognitive uptime-this work establishes a foundation for runtime dependability in agentic cognition, transforming cognitive recovery from an ad-hoc process into a standardized, interpretable performance

中文标题/摘要

标题：MTTR-A：多智能体系统中认知恢复延迟的度量

确保自主多智能体系统（MAS）的认知稳定性是大规模分布式人工智能中的核心挑战。虽然现有的可观测性工具监控系统输出，但它们无法量化智能体工作流在推理一致性丢失后恢复的速度。我们借鉴经典的可靠性度量——平均恢复时间（MTTR）、平均故障间隔时间（MTBF）及相关比率——将其引入认知领域，定义MTTR-A（智能体系统平均恢复时间）作为运行时的认知恢复延迟度量。MTTR-A量化了MAS检测推理漂移并恢复一致运行所需的时间，捕捉的是推理一致性的恢复而非基础设施的修复。使用AG~News语料库和LangGraph编排框架进行了基准模拟，模型了多种反射模式下的恢复延迟。自动反射在大约6秒内恢复了稳定性，而人工审批干预则需要约12秒。在200次运行中，模拟的中位数MTTR-A为6.21±2.14秒，MTBF=6.7±2.14秒，NRR=0.08，展示了不同反射策略下的可测量运行时弹性。通过将恢复延迟形式化为分布式推理的可量化属性，并推导出恢复时间和认知运行时间之间的可靠性界限，这项工作为智能体认知的运行时可靠性奠定了基础，将认知恢复从一种随意的过程转变为一种标准化、可解释的性能

Summary / 总结

This paper introduces MTTR-A, a metric for measuring cognitive recovery latency in multi-agent systems (MAS), adapting classical reliability metrics to the cognitive domain. The study uses a benchmark simulation with the AG News corpus and LangGraph to evaluate reflex modes, showing that automated reflexes restored stability within 6 seconds on average, while human-approval interventions took about 12 seconds. The median simulated MTTR-A was 6.21±2.14 seconds, indicating measurable runtime resilience across reflex strategies.

本文提出了MTTR-A，一种用于衡量多代理系统（MAS）认知恢复延迟的指标，将经典可靠性指标适应到认知领域。研究使用AG News语料库和LangGraph进行基准模拟，评估不同反射模式下的恢复时间，结果显示自动反射平均在6秒内恢复稳定性，而人工干预则需要约12秒。200次运行的中位数MTTR-A为6.21±2.14秒，表明反射策略在运行时具有可测量的弹性。

UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI

Authors: Darvin Yi, Teng Liu, Mattie Terzolo, Lance Hasson, Ayan Sinha, Pablo Mendes, Andrew Rabinovich

First: 2025-11-15T17:39:37+00:00 · Latest: 2025-12-12T17:51:50+00:00

Abs · PDF · Code1 · Code2

Abstract

As large language model (LLM) agents increasingly undertake digital work, reliable frameworks are needed to evaluate their real-world competence, adaptability, and capacity for human collaboration. Existing benchmarks remain largely static, synthetic, or domain-limited, providing limited insight into how agents perform in dynamic, economically meaningful environments. We introduce UpBench, a dynamically evolving benchmark grounded in real jobs drawn from the global Upwork labor marketplace. Each task corresponds to a verified client transaction, anchoring evaluation in genuine work activity and financial outcomes. UpBench employs a rubric-based evaluation framework, in which expert freelancers decompose each job into detailed, verifiable acceptance criteria and assess AI submissions with per-criterion feedback. This structure enables fine-grained analysis of model strengths, weaknesses, and instruction-following fidelity beyond binary pass/fail metrics. Human expertise is integrated throughout the data pipeline (from job curation and rubric construction to evaluation) ensuring fidelity to real professional standards and supporting research on human-AI collaboration. By regularly refreshing tasks to reflect the evolving nature of online work, UpBench provides a scalable, human-centered foundation for evaluating agentic systems in authentic labor-market contexts, offering a path toward a collaborative framework, where AI amplifies human capability through partnership rather than replacement.

中文标题/摘要

标题：UpBench：一种基于真实劳动力市场的动态演变代理基准框架，旨在为以人为本的AI构建

随着大型语言模型（LLM）代理越来越多地承担数字工作，需要可靠的框架来评估其在现实世界中的能力、适应性和与人类协作的能力。现有基准大多保持静态、合成或领域限制，提供的洞察有限，无法反映代理在动态、经济意义上重要的环境中表现如何。我们介绍了UpBench，这是一种基于全球Upwork劳动力市场的动态演变基准，其基础是真实的工作任务。每个任务对应一个经过验证的客户交易，将评估锚定在真实的劳动活动和财务结果上。UpBench采用基于评分的评估框架，其中专家自由职业者将每个任务分解为详细的、可验证的接受标准，并对AI提交内容进行逐项反馈评估。这种结构使我们能够对模型的优势、弱点和指令遵循的准确性进行精细分析，超越了简单的通过/未通过指标。在整个数据管道中（从任务筛选、评分标准构建到评估）整合人类专业知识，确保符合真实的专业标准，并支持人类-AI协作的研究。通过定期更新任务以反映在线工作的演变，UpBench为评估代理系统在真实的劳动力市场环境中的表现提供了可扩展、以人为本的基础，提供了一条通往合作框架的道路，在这种框架中，AI通过伙伴关系而非替代来增强人类能力。

Summary / 总结

UpBench is a dynamically evolving benchmark for evaluating the real-world competence, adaptability, and human collaboration capabilities of large language model agents. It uses tasks from the global Upwork labor marketplace, ensuring that evaluations are grounded in genuine work activity and financial outcomes. The benchmark employs a rubric-based evaluation framework where expert freelancers assess AI submissions with detailed, verifiable criteria, providing a fine-grained analysis beyond simple pass/fail metrics. Key findings include the ability to evaluate model strengths, weaknesses, and instruction-following fidelity in dynamic, economically meaningful environments.

UpBench 是一个动态演化的基准框架，用于评估大型语言模型代理在现实世界中的专业能力、适应性和与人类的合作能力。它使用全球 Upwork 劳动力市场的任务，确保评估基于真实的日常工作活动和财务成果。该基准框架采用基于评分表的评估体系，其中专家自由职业者根据详细的、可验证的标准评估 AI 提交，提供超越简单通过/未通过指标的精细分析。主要发现包括能够评估模型的优势、劣势和指令遵循的准确性，在动态、经济上有意义的环境中。

REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving

Authors: Annabelle Sujun Tang, Christopher Priebe, Rohan Mahapatra, Lianhui Qin, Hadi Esmaeilzadeh

Venue: NeurIPS 2025

First: 2025-06-02T07:02:46+00:00 · Latest: 2025-12-12T17:38:28+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

While model serving has unlocked unprecedented capabilities, the high cost of serving large-scale models continues to be a significant barrier to widespread accessibility and rapid innovation. Compiler optimizations have long driven substantial performance improvements, but existing compilers struggle with neural workloads due to the exponentially large and highly interdependent space of possible transformations. Although existing stochastic search techniques can be effective, they are often sample-inefficient and fail to leverage the structural context underlying compilation decisions. We set out to investigate the research question of whether reasoning with large language models (LLMs), without any retraining, can leverage the context-aware decision space of compiler optimizations to significantly improve sample efficiency. To that end, we introduce a novel compilation framework (dubbed Reasoning Compiler) that formulates optimization as a sequential, context-aware decision process guided by a large language model and structured Monte Carlo tree search (MCTS). The LLM acts as a proposal mechanism, suggesting hardware-informed transformations that reflect the current program state and accumulated performance feedback. MCTS incorporates the LLM-generated proposals to balance exploration and exploitation, facilitating structured, context-sensitive traversal of the expansive compiler optimization space. By achieving substantial speedups with markedly fewer samples than leading neural compilers, our approach demonstrates the potential of LLM-guided reasoning to transform the landscape of compiler optimization.

中文标题/摘要

标题：编译器：大型语言模型指导的高效模型服务优化

尽管模型服务解锁了前所未有的能力，但大规模模型的高成本服务仍然是广泛访问和快速创新的主要障碍。编译器优化长期以来推动了显著的性能改进，但现有编译器在处理神经工作负载时遇到困难，因为可能的转换空间既庞大又高度相互依赖。尽管现有的随机搜索技术可能有效，但它们通常样本效率低下，并且无法利用编译决策下的结构上下文。我们着手研究一个研究问题：是否可以通过不重新训练的方式，利用大型语言模型（LLM）的上下文感知决策空间，显著提高样本效率。为此，我们提出了一种新颖的编译框架（称为推理编译器），将优化过程表述为由大型语言模型和结构化蒙特卡洛树搜索（MCTS）引导的顺序、上下文感知决策过程。LLM 作为建议机制，提出反映当前程序状态和累积性能反馈的硬件导向变换。MCTS 结合 LLMMCTS 生成的建议来平衡探索和利用，促进对庞大编译优化空间的结构化、上下文敏感遍历。通过在显著少于领先神经编译器的样本数量下实现显著加速，我们的方法展示了 LLM 指导推理的潜力，可以改变编译优化的格局。

Summary / 总结

The research aims to improve the efficiency of serving large-scale models by leveraging large language models (LLMs) for compiler optimizations. The proposed Reasoning Compiler framework uses a large language model and structured Monte Carlo tree search to guide optimization decisions, balancing exploration and exploitation. This approach achieves significant speedups with fewer samples compared to existing neural compilers, indicating the potential of LLM-guided reasoning in enhancing compiler optimization.

研究旨在通过利用大型语言模型（LLMs）进行编译优化，提高大规模模型的服务效率。Reasoning Compiler框架使用大型语言模型和结构化蒙特卡洛树搜索来引导优化决策，平衡探索与利用。该方法在较少样本的情况下实现了显著的加速，表明LLM引导的推理有可能改变编译优化的格局。

DeepSeek's WEIRD Behavior: The cultural alignment of Large Language Models and the effects of prompt language and cultural prompting

Authors: James Luther, Donald Brown

First: 2025-12-10T15:54:18+00:00 · Latest: 2025-12-12T17:25:30+00:00

Abs · PDF · Code1 · Code2

Abstract

Culture is a core component of human-to-human interaction and plays a vital role in how we perceive and interact with others. Advancements in the effectiveness of Large Language Models (LLMs) in generating human-sounding text have greatly increased the amount of human-to-computer interaction. As this field grows, the cultural alignment of these human-like agents becomes an important field of study. Our work uses Hofstede's VSM13 international surveys to understand the cultural alignment of the following models: DeepSeek-V3, V3.1, GPT-4, GPT-4.1, GPT-4o, and GPT-5. We use a combination of prompt language and cultural prompting, a strategy that uses a system prompt to shift a model's alignment to reflect a specific country, to align these LLMs with the United States and China. Our results show that DeepSeek-V3, V3.1, and OpenAI's GPT-5 exhibit a close alignment with the survey responses of the United States and do not achieve a strong or soft alignment with China, even when using cultural prompts or changing the prompt language. We also find that GPT-4 exhibits an alignment closer to China when prompted in English, but cultural prompting is effective in shifting this alignment closer to the United States. Other low-cost models, GPT-4o and GPT-4.1, respond to the prompt language used (i.e., English or Simplified Chinese) and cultural prompting strategies to create acceptable alignments with both the United States and China.

中文标题/摘要

标题：DeepSeek的奇特行为：大型语言模型的文化对齐及其提示语言和文化提示的影响

文化是人与人之间互动的核心组成部分，对我们的感知和互动方式起着至关重要的作用。大型语言模型（LLMs）在生成人类语言文本方面效果的提升，极大地增加了人机互动的数量。随着这一领域的增长，这些类人代理的文化对齐成为了一个重要的研究领域。我们的研究使用霍夫斯泰德的VSM13国际调查来理解以下模型的文化对齐情况：DeepSeek-V3、V3.1、GPT-4、GPT-4.1、GPT-4o和GPT-5。我们使用提示语言和文化提示的策略，通过系统提示来调整模型的对齐，使其反映特定国家的文化，以使这些LLMs与美国和中国对齐。结果显示，DeepSeek-V3、V3.1和OpenAI的GPT-5与美国的调查响应表现出紧密的对齐，即使使用文化提示或改变提示语言，也无法实现与中国较强的或温和的对齐。我们还发现，当用英语提示时，GPT-4更接近中国的对齐，但文化提示可以有效地将这种对齐调整得更接近美国。其他低成本模型GPT-4o和GPT-4.1会根据使用的提示语言（即英语或简体中文）和文化提示策略来创建与美国和中国都可接受的对齐。

Summary / 总结

This study investigates the cultural alignment of Large Language Models (LLMs) using Hofstede's VSM13 international surveys. The researchers employed a combination of prompt language and cultural prompting to align models with the United States and China. Key findings include that DeepSeek-V3, V3.1, and GPT-5 closely align with U.S. survey responses but do not align strongly with China, even with cultural prompts. GPT-4 aligns more with China when prompted in English, but cultural prompting can shift its alignment closer to the U.S. GPT-4o and GPT-4.1 respond to prompt language and cultural prompting to achieve acceptable alignments with both the U.S. and China.

本研究使用Hofstede的VSM13国际调查来研究大型语言模型（LLMs）的文化对齐情况。研究人员采用组合提示语言和文化提示的方法，将模型与美国和中国对齐。主要发现包括：DeepSeek-V3、V3.1和GPT-5与美国调查响应高度对齐，但即使使用文化提示也无法与中国对齐。GPT-4在用英语提示时更接近中国，但文化提示可以将其对齐更接近美国。GPT-4o和GPT-4.1对提示语言和文化提示策略作出响应，可以实现与美国和中国的可接受对齐。

SOF: Sorted Opacity Fields for Fast Unbounded Surface Reconstruction

Authors: Lukas Radl, Felix Windisch, Thomas Deixelberger, Jozef Hladky, Michael Steiner, Dieter Schmalstieg, Markus Steinberger

Venue: SIGGRAPH Asia 2025

First: 2025-06-23T21:20:52+00:00 · Latest: 2025-12-12T17:12:11+00:00

Comments: SIGGRAPH Asia 2025; Project Page: https://r4dl.github.io/SOF/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in 3D Gaussian representations have significantly improved the quality and efficiency of image-based scene reconstruction. Their explicit nature facilitates real-time rendering and fast optimization, yet extracting accurate surfaces - particularly in large-scale, unbounded environments - remains a difficult task. Many existing methods rely on approximate depth estimates and global sorting heuristics, which can introduce artifacts and limit the fidelity of the reconstructed mesh. In this paper, we present Sorted Opacity Fields (SOF), a method designed to recover detailed surfaces from 3D Gaussians with both speed and precision. Our approach improves upon prior work by introducing hierarchical resorting and a robust formulation of Gaussian depth, which better aligns with the level-set. To enhance mesh quality, we incorporate a level-set regularizer operating on the opacity field and introduce losses that encourage geometrically-consistent primitive shapes. In addition, we develop a parallelized Marching Tetrahedra algorithm tailored to our opacity formulation, reducing meshing time by up to an order of magnitude. As demonstrated by our quantitative evaluation, SOF achieves higher reconstruction accuracy while cutting total processing time by more than a factor of three. These results mark a step forward in turning efficient Gaussian-based rendering into equally efficient geometry extraction.

中文标题/摘要

标题：SOF：排序透明度字段以实现快速无界表面重建

近年来，3D 高斯表示的进展显著提高了基于图像场景重建的质量和效率。它们的显式性质便于实时渲染和快速优化，但提取准确的表面——特别是在大规模、无界环境中——仍然是一个难题。许多现有方法依赖于近似深度估计和全局排序启发式，这可能会引入伪影并限制重建网格的保真度。在本文中，我们提出了排序透明度字段（SOF），这是一种旨在从3D高斯中恢复详细表面的方法，兼具速度和精度。我们的方法通过引入分层重新排序和高斯深度的稳健公式改进了先前的工作，这更好地与水平集对齐。为了提高网格质量，我们在透明度字段上引入了水平集正则化，并引入了鼓励几何一致的原始形状的损失。此外，我们开发了一种针对我们透明度公式进行并行化的Marching Tetrahedra算法，将网格生成时间减少了十倍。正如我们的定量评估所显示的，SOF在提高重建精度的同时，将总处理时间缩短了三倍以上。这些结果标志着将高效的高斯渲染转化为同样高效的几何提取迈出了一步。

Summary / 总结

The motivation for this work is to improve the accuracy and efficiency of surface reconstruction from 3D Gaussian representations, especially in large-scale environments. The method, Sorted Opacity Fields (SOF), introduces hierarchical resorting and a robust Gaussian depth formulation to better align with level-sets. Key experimental findings include higher reconstruction accuracy and a reduction in total processing time by more than a factor of three, achieved through a parallelized Marching Tetrahedra algorithm and a level-set regularizer on the opacity field.

该研究旨在提高从3D高斯表示中进行表面重建的准确性和效率，特别是在大规模、无边界的环境中。方法Sorted Opacity Fields (SOF) 引入了分层重新排序和稳健的高斯深度公式，使其更好地与水平集对齐。关键实验结果表明，SOF 在提高重建准确性的同时，将总处理时间减少了三倍以上，并通过针对其透明度公式优化的并行Marching Tetrahedra算法显著缩短了网格生成时间。

From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

Authors: Titaya Mairittha, Tanakon Sawanglok, Panuwit Raden, Jirapast Buntub, Thanapat Warunee, Napat Asawachaisuvikrom, Thanaphum Saiwongin

First: 2025-12-12T17:05:11+00:00 · Latest: 2025-12-12T17:05:11+00:00

Comments: 6 pages, 1 figure

Abs · PDF · Code1 · Code2

Abstract

While voice-based AI systems have achieved remarkable generative capabilities, their interactions often feel conversationally broken. This paper examines the interactional friction that emerges in modular Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) pipelines. By analyzing a representative production system, we move beyond simple latency metrics to identify three recurring patterns of conversational breakdown: (1) Temporal Misalignment, where system delays violate user expectations of conversational rhythm; (2) Expressive Flattening, where the loss of paralinguistic cues leads to literal, inappropriate responses; and (3) Repair Rigidity, where architectural gating prevents users from correcting errors in real-time. Through system-level analysis, we demonstrate that these friction points should not be understood as defects or failures, but as structural consequences of a modular design that prioritizes control over fluidity. We conclude that building natural spoken AI is an infrastructure design challenge, requiring a shift from optimizing isolated components to carefully choreographing the seams between them.

中文标题/摘要

标题：从信号到转变：模块化语音到语音管道中的互动摩擦

尽管基于语音的AI系统在生成能力上取得了显著进展，但它们的互动往往在对话上显得不连贯。本文探讨了在模块化语音到语音检索增强生成（S2S-RAG）管道中出现的互动摩擦。通过分析一个代表性的生产系统，我们超越了简单的延迟指标，识别出三种反复出现的对话中断模式：（1）时间错位，其中系统延迟违反了用户对对话节奏的期望；（2）表达扁平化，其中语外线索的丢失导致了字面且不适当的回应；（3）修复僵化，其中架构控制阻止用户在实时纠正错误。通过系统级分析，我们表明这些摩擦点不应被视为缺陷或失败，而是模块化设计结构上的后果，该设计优先考虑控制而非流畅性。我们得出结论，构建自然语音AI是一个基础设施设计挑战，需要从优化孤立组件转向精心编排它们之间的接缝。

Summary / 总结

This paper investigates the conversational breakdowns in modular Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG) pipelines, focusing on three patterns: Temporal Misalignment, Expressive Flattening, and Repair Rigidity. By analyzing a production system, the authors identify these issues as structural consequences of a modular design that prioritizes control over fluidity. The study suggests that building natural spoken AI requires a shift from optimizing isolated components to carefully choreographing the interactions between them.

该论文研究了模块化Speech-to-Speech Retrieval-Augmented Generation (S2S-RAG)管道中的交互摩擦，重点关注三个反复出现的问题：时间错位、表达单调和修复僵化。通过对生产系统的分析，作者将这些问题视为模块化设计优先控制而非流畅性所导致的结构后果，建议需要更好的基础设施设计来解决这些摩擦点。

Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

Authors: Zhaodong Wang, Zhenting Qi, Sherman Wong, Nathan Hu, Samuel Lin, Jun Ge, Erwin Gao, Yining Yang, Ben Maurer, Wenlin Chen, David Recordon, Yilun Du, Minlan Yu, Ying Zhang

First: 2025-12-11T08:05:58+00:00 · Latest: 2025-12-12T16:59:12+00:00

Comments: Meta requires more thorough internal review process to ensure paper quality and experiments as well as compliance with the internal research publishing process

Abs · PDF · Code1 · Code2

Abstract

Real-world AI software engineering demands coding agents that can reason over massive repositories, maintain durable memory across and within long sessions, and robustly coordinate complex toolchains at test time. Existing open-source coding agents provide transparency but frequently fall short when pushed to these industrial-scale workloads, while proprietary coding agents offer strong practical performance but limited extensibility, interpretability, and controllability. We present the Confucius Code Agent (CCA), an open-sourced AI software engineer that can operate at an industrial scale. CCA is built atop the Confucius SDK, an open-sourced agent development platform designed around three complementary perspectives: Agent Experience (AX), User Experience (UX), and Developer Experience (DX). The SDK introduces a unified orchestrator with hierarchical working memory for long-context reasoning, a persistent note-taking system for cross-session continual learning, and a modular extension module for robust tool use. Moreover, a meta-agent automates the synthesis, evaluation, and refinement of agent configurations through a build-test-improve loop, enabling rapid agent development on new tasks, environments, and tool stacks. Instantiated on Confucius SDK with these mechanisms, CCA delivers strong performance on real-world software engineering tasks. On SWE-Bench-Pro, CCA achieves a state-of-the-art Resolve@1 performance of 54.3%, substantially improving over prior coding agents. Together, the Confucius SDK and CCA provide a transparent, extensible, and reproducible foundation for AI agents, bridge gaps between research prototypes and production-grade systems, and support agent development and deployment at industrial scale.

中文标题/摘要

标题：孔夫子代码代理：工业规模的开源AI软件工程师

现实中的AI软件工程需要能够对大规模代码库进行推理、在长时间会话内外保持持久记忆，并在测试时稳健地协调复杂工具链的编码代理。现有的开源编码代理提供了透明性，但在推向工业规模的工作负载时经常表现不佳，而专有的编码代理则提供了强大的实际性能，但受限于扩展性、可解释性和可控性。我们介绍了孔夫子代码代理（CCA），这是一种可以在工业规模上运行的开源AI软件工程师。CCA基于孔夫子SDK构建，这是一个围绕代理体验（AX）、用户体验（UX）和开发体验（DX）三个互补视角设计的开源代理开发平台。SDK引入了一个统一的协调器，具有分层工作记忆，用于长上下文推理，一个持久的笔记系统，用于跨会话持续学习，以及一个模块化的扩展模块，用于稳健地使用工具。此外，一个元代理通过构建-测试-改进循环自动化编码代理配置的合成、评估和优化，从而实现快速开发新任务、环境和工具堆栈上的编码代理。通过这些机制在孔夫子SDK上实现，CCA在实际软件工程任务上表现出色。在SWE-Bench-Pro上，CCA实现了54.3%的最先进的Resolve@1性能，显著优于之前的编码代理。孔夫子SDK和CCA共同提供了一个透明、可扩展和可重复的基础架构，用于AI代理，填补了研究原型与生产级系统之间的差距，并支持工业规模的代理开发和部署。

Reframing Music-Driven 2D Dance Pose Generation as Multi-Channel Image Generation

Authors: Yan Zhang, Han Zou, Lincong Feng, Cong Xie, Ruiqi Yu, Zhenpeng Zhan

First: 2025-12-12T16:57:46+00:00 · Latest: 2025-12-12T16:57:46+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent pose-to-video models can translate 2D pose sequences into photorealistic, identity-preserving dance videos, so the key challenge is to generate temporally coherent, rhythm-aligned 2D poses from music, especially under complex, high-variance in-the-wild distributions. We address this by reframing music-to-dance generation as a music-token-conditioned multi-channel image synthesis problem: 2D pose sequences are encoded as one-hot images, compressed by a pretrained image VAE, and modeled with a DiT-style backbone, allowing us to inherit architectural and training advances from modern text-to-image models and better capture high-variance 2D pose distributions. On top of this formulation, we introduce (i) a time-shared temporal indexing scheme that explicitly synchronizes music tokens and pose latents over time and (ii) a reference-pose conditioning strategy that preserves subject-specific body proportions and on-screen scale while enabling long-horizon segment-and-stitch generation. Experiments on a large in-the-wild 2D dance corpus and the calibrated AIST++2D benchmark show consistent improvements over representative music-to-dance methods in pose- and video-space metrics and human preference, and ablations validate the contributions of the representation, temporal indexing, and reference conditioning. See supplementary videos at https://hot-dance.github.io

中文标题/摘要

标题：将音乐驱动的2D舞蹈姿态生成重新构想为多通道图像生成

近期的姿势到视频模型可以将2D姿态序列转化为具有保真度的、身份保持的舞蹈视频，因此关键挑战是从音乐中生成时间上连贯、节奏对齐的2D姿态，尤其是在复杂、高变异性的真实世界分布下。我们通过将音乐到舞蹈生成重新构想为音乐标记条件下的多通道图像合成问题来解决这一问题：2D姿态序列被编码为一热图像，通过预训练的图像VAE压缩，并使用DiT风格的骨干模型进行建模，使我们能够继承现代文本到图像模型的架构和训练进步，更好地捕捉2D姿态的高变异性分布。在此基础上，我们引入了(i)一种时间共享的时间索引方案，明确同步音乐标记和姿态潜变量随时间的变化，以及(ii)一种参考姿态条件策略，保留特定主体的身体比例和屏幕尺寸，同时允许长时段的片段和缝合生成。在大型真实世界2D舞蹈语料库和校准的AIST++2D基准测试上进行的实验显示，在姿态和视频空间度量以及人类偏好方面，该方法相对于代表性音乐到舞蹈方法的一致改进，并且消融实验验证了表示、时间索引和参考条件的贡献。请参见补充视频：https://hot-dance.github.io

Summary / 总结

This paper addresses the challenge of generating temporally coherent and rhythm-aligned 2D dance poses from music, by reframing the problem as a multi-channel image synthesis task. The method uses a pretrained image VAE to encode 2D pose sequences as one-hot images, and a DiT-style backbone to model these images, allowing for better handling of high-variance pose distributions. Key contributions include a time-shared temporal indexing scheme and a reference-pose conditioning strategy, which improve synchronization and preserve subject-specific proportions. Experiments show consistent improvements over existing methods in both pose and video metrics, as well as human preference scores. Ablation studies confirm the effectiveness of the proposed techniques.

研究旨在从音乐生成时间上连贯且节奏对齐的2D舞蹈姿态，解决野外高变异性分布的挑战。方法将问题重新定义为多通道图像合成任务，使用预训练的图像VAE和DiT风格的骨干网络。关键贡献包括时间共享的时间索引方案和参考姿态条件策略，这些策略在姿态和视频指标以及人类偏好方面均显示出改进。实验在大型数据集和基准上显示了相对于现有方法的一致改进。消融实验验证了所提组件的有效性。

Referring Change Detection in Remote Sensing Imagery

Authors: Yilmaz Korkmaz, Jay N. Paranjape, Celso M. de Melo, Vishal M. Patel

Venue: WACV

First: 2025-12-12T16:57:12+00:00 · Latest: 2025-12-12T16:57:12+00:00

Comments: 2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Abs · PDF · Code1 · Code2 · Project1

Abstract

Change detection in remote sensing imagery is essential for applications such as urban planning, environmental monitoring, and disaster management. Traditional change detection methods typically identify all changes between two temporal images without distinguishing the types of transitions, which can lead to results that may not align with specific user needs. Although semantic change detection methods have attempted to address this by categorizing changes into predefined classes, these methods rely on rigid class definitions and fixed model architectures, making it difficult to mix datasets with different label sets or reuse models across tasks, as the output channels are tightly coupled with the number and type of semantic classes. To overcome these limitations, we introduce Referring Change Detection (RCD), which leverages natural language prompts to detect specific classes of changes in remote sensing images. By integrating language understanding with visual analysis, our approach allows users to specify the exact type of change they are interested in. However, training models for RCD is challenging due to the limited availability of annotated data and severe class imbalance in existing datasets. To address this, we propose a two-stage framework consisting of (I) \textbf{RCDNet}, a cross-modal fusion network designed for referring change detection, and (II) \textbf{RCDGen}, a diffusion-based synthetic data generation pipeline that produces realistic post-change images and change maps for a specified category using only pre-change image, without relying on semantic segmentation masks and thereby significantly lowering the barrier to scalable data creation. Experiments across multiple datasets show that our framework enables scalable and targeted change detection. Project website is here: https://yilmazkorkmaz1.github.io/RCD.

中文标题/摘要

标题：遥感图像中的变化检测引用

遥感图像的变化检测对于城市规划、环境监测和灾害管理等应用至关重要。传统的变化检测方法通常会在两个时间点的图像之间识别所有变化，但不区分变化类型，这可能导致结果不符合特定用户的需求。虽然语义变化检测方法试图通过将变化分类为预定义类别来解决这一问题，但这些方法依赖于固定的类别定义和模型架构，使得难以混合不同标签集的数据集或将模型跨任务重用，因为输出通道与类别数量和类型紧密耦合。为克服这些限制，我们引入了引用变化检测（RCD），该方法利用自然语言提示来检测遥感图像中的特定类别变化。通过将语言理解与视觉分析相结合，我们的方法允许用户指定他们感兴趣的精确变化类型。然而，由于标注数据的有限可用性和现有数据集中类别不平衡的严重性，训练RCD模型具有挑战性。为解决这一问题，我们提出了一种两阶段框架，包括（I）RCDNet，一种用于引用变化检测的跨模态融合网络，以及（II）RCDGen，一种基于扩散的合成数据生成管道，该管道仅使用预变化图像生成指定类别的现实后变化图像和变化图，而不依赖于语义分割掩码，从而显著降低了大规模数据创建的门槛。在多个数据集上的实验表明，我们的框架能够实现可扩展且有针对性的变化检测。项目网站在此：https://yilmazkorkmaz1.github.io/RCD/

Summary / 总结

The research aims to improve change detection in remote sensing imagery by addressing the limitations of traditional and semantic change detection methods. The proposed Referring Change Detection (RCD) framework uses natural language prompts to detect specific types of changes, integrating language understanding with visual analysis. The framework consists of RCDNet, a cross-modal fusion network, and RCDGen, a synthetic data generation pipeline, which helps in overcoming the challenges of limited annotated data and class imbalance. Experiments demonstrate that RCD enables scalable and targeted change detection across multiple datasets.

该论文提出了Referring Change Detection (RCD) 方法，通过自然语言提示检测遥感图像中的特定类型变化，解决了传统和语义变化检测方法的局限性。该方法采用两阶段框架：RCDNet，一种跨模态融合网络，和RCDGen，一种基于扩散的合成数据生成管道。实验表明，RCD 能够在多个数据集上实现可扩展和针对性的变化检测。

Speculative Decoding Speed-of-Light: Optimal Lower Bounds via Branching Random Walks

Authors: Sergey Pankratov, Dan Alistarh

First: 2025-12-12T16:54:33+00:00 · Latest: 2025-12-12T16:54:33+00:00

Abs · PDF · Code1 · Code2

Abstract

Speculative generation has emerged as a promising technique to accelerate inference in large language models (LLMs) by leveraging parallelism to verify multiple draft tokens simultaneously. However, the fundamental limits on the achievable speedup remain poorly understood. In this work, we establish the first ``tight'' lower bounds on the runtime of any deterministic speculative generation algorithm. This is achieved by drawing a parallel between the token generation process and branching random walks, which allows us to analyze the optimal draft tree selection problem. We prove, under basic assumptions, that the expected number of tokens successfully predicted per speculative iteration is bounded as $\mathbb{E}[X] \leq (μ+ μ_{(2)})\log(P )/μ^2 + O(1)$, where $P$ is the verifier's capacity, $μ$ is the expected entropy of the verifier's output distribution, and $μ_{(2)}$ is the expected second log-moment. This result provides new insights into the limits of parallel token generation, and could guide the design of future speculative decoding systems. Empirical evaluations on Llama models validate our theoretical predictions, confirming the tightness of our bounds in practical settings.

Summary / 总结

This work investigates the fundamental limits of speculative generation in accelerating large language models (LLMs) by establishing the first tight lower bounds on the runtime of any deterministic speculative generation algorithm. By comparing the token generation process to branching random walks, the authors prove that the expected number of tokens successfully predicted per speculative iteration is bounded by a specific formula. Empirical evaluations on Llama models support the theoretical findings, demonstrating the practical tightness of these bounds.

本文研究了投机生成在加速大型语言模型（LLMs）方面的基本限制，通过建立首个确定性投机生成算法的紧下界来实现。作者将令牌生成过程与分支随机游走进行类比，以分析最佳草稿树选择问题。关键发现是，每次投机迭代中成功预测的令牌数量受到特定公式的限制，这为平行令牌生成的限制提供了新的见解，并可能指导未来投机解码系统的开发。对Llama模型的实证评估证实了这些理论预测的紧致性。

Med-REFL: Medical Reasoning Enhancement via Self-Corrected Fine-grained Reflection

Authors: Zongxian Yang, Jiayu Qian, Zegao Peng, Haoyu Zhang, Yu-An Huang, KC Tan, Zhi-An Huang

First: 2025-06-11T14:58:38+00:00 · Latest: 2025-12-12T16:49:44+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large reasoning models excel in domains like mathematics where intermediate reasoning is straightforward to verify, but struggle to self-correct in medicine fields where evaluating intermediate reasoning is cumbersome and expensive. This verification bottleneck hinders the development of reliable AI reasoners for high-stakes application. Here we propose Med-REFL, a novel framework that learns fine-grained reflection without human labels or model distillation. Med-REFL introduces a deterministic structural assessment of the reasoning space to automatically generate preference data for reflection. By globally evaluating all explored reasoning paths in a tree-of-thoughts, our method quantifies the value of corrective actions, enabling the automated construction of direct preference optimization pairs. This trains the model to recognize and amend its own reasoning fallacies. Extensive experiments show Med-REFL delivers robust gains across diverse models architectures and medical benchmarks, boosting a general-purpose Llama3.1-8B by +5.82% and the state-of-the-art Huatuo-o1 by +4.13% on the MedQA benchmark. Our Med-REFL-8B achieves state-of-the-art performance among 7-8B models while even competing with models twice its size. Crucially, targeted ablations prove its success generalizes to other domains such as logical reasoning and mitigates the `fake reflection' phenomenon in LRMs. Ultimately, our framework provides a scalable solution to the verification bottleneck, paving the way for more reliable AI reasoners in high-stakes domains like medicine. Med-REFL has been made publicly available in https://github.com/TianYin123/Med-REFL.

中文标题/摘要

标题：Med-REFL：通过自我纠正的细粒度反思提升医学推理能力

大型推理模型在数学等领域表现出色，因为中间推理易于验证，但在医学领域却难以自我纠正，因为评估中间推理既繁琐又昂贵。这种验证瓶颈阻碍了可靠AI推理器在高风险应用中的发展。为此，我们提出了一种名为Med-REFL的新框架，该框架无需人工标签或模型蒸馏即可学习细粒度的反思。Med-REFL引入了一种确定性的结构评估方法，以自动生成反思的偏好数据。通过全局评估思维树中探索的所有推理路径，我们的方法量化了纠正行动的价值，从而能够自动构建直接的偏好优化对。这使模型能够识别并修正自身的推理谬误。广泛实验表明，Med-REFL在多种模型架构和医学基准测试中均取得了稳健的提升，使通用Llama3.1-8B的性能提高了5.82%，使最先进的Huatuo-o1在MedQA基准测试中的性能提高了4.13%。我们的Med-REFL-8B在7-8B模型中达到了最先进的性能，甚至与规模是其两倍的模型竞争。关键的是，有针对性的消融实验表明，其成功可以推广到其他领域，如逻辑推理，并减轻LRMs中的“假反思”现象。最终，我们的框架提供了一种可扩展的解决方案，以克服验证瓶颈，为医学等高风险领域中的更可靠AI推理器铺平了道路。Med-REFL已在https://github.com/TianYin123/Med-REFL/公开。

Summary / 总结

Med-REFL is a framework designed to enhance medical reasoning in AI models by enabling self-correction through fine-grained reflection. It evaluates all reasoning paths in a tree-of-thoughts to generate preference data for reflection, allowing the model to recognize and correct its own reasoning errors. Extensive experiments show Med-REFL improves performance across various model architectures and medical benchmarks, with significant gains for both general-purpose and specialized models.

Med-REFL 是一种框架，旨在通过细粒度反思增强医疗推理模型的自我纠正能力，无需人工标签或模型蒸馏。它通过评估思维树中的所有推理路径来生成偏好数据，从而训练模型识别并纠正其推理错误。实验表明，Med-REFL 在 Llama3.1-8B 和 Huatuo-o1 等模型上分别提升了 5.82% 和 4.13%，并在 MedQA 基准测试中达到 7-8B 模型的最先进性能。

Text2Graph: Combining Lightweight LLMs and GNNs for Efficient Text Classification in Label-Scarce Scenarios

Authors: João Lucas Luz Lima Sarcinelli, Ricardo Marcondes Marcacini

First: 2025-12-10T20:31:30+00:00 · Latest: 2025-12-12T16:45:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) have become effective zero-shot classifiers, but their high computational requirements and environmental costs limit their practicality for large-scale annotation in high-performance computing (HPC) environments. To support more sustainable workflows, we present Text2Graph, an open-source Python package that provides a modular implementation of existing text-to-graph classification approaches. The framework enables users to combine LLM-based partial annotation with Graph Neural Network (GNN) label propagation in a flexible manner, making it straightforward to swap components such as feature extractors, edge construction methods, and sampling strategies. We benchmark Text2Graph on a zero-shot setting using five datasets spanning topic classification and sentiment analysis tasks, comparing multiple variants against other zero-shot approaches for text classification. In addition to reporting performance, we provide detailed estimates of energy consumption and carbon emissions, showing that graph-based propagation achieves competitive results at a fraction of the energy and environmental cost.

中文标题/摘要

标题：Text2Graph：结合轻量级LLM和GNN的高效文本分类方法

大型语言模型（LLMs）已成为有效的零样本分类器，但其高计算需求和环境成本限制了其在高性能计算（HPC）环境中的大规模注释实用性。为了支持更可持续的工作流程，我们提出了Text2Graph，这是一个开源的Python包，提供了现有文本到图分类方法的模块化实现。该框架允许用户以灵活的方式结合基于LLM的部分注释与图神经网络（GNN）标签传播，使得可以方便地更换特征提取器、边构建方法和采样策略等组件。我们在五个涵盖主题分类和情感分析任务的数据集上对Text2Graph进行了零样本设置下的基准测试，将多种变体与其他文本分类的零样本方法进行了比较。除了报告性能外，我们还提供了详细的能耗和碳排放估算，显示基于图的传播在能耗和环境成本方面实现了具有竞争力的结果。

Summary / 总结

The research aims to address the computational and environmental challenges of using large language models (LLMs) for text classification in label-scarce scenarios. Text2Graph, an open-source Python package, combines LLMs and GNNs to enable efficient text classification. The framework allows users to flexibly integrate LLM-based partial annotation with GNN label propagation. Experiments on five datasets show that Text2Graph achieves competitive performance while significantly reducing energy consumption and carbon emissions compared to other zero-shot approaches.

研究旨在解决大规模语言模型（LLMs）在标签稀缺场景下进行文本分类时的计算和环境挑战。方法是利用Text2Graph开源Python包，将轻量级LLMs与GNNs结合，实现高效的标签传播。关键发现表明，该方法在性能上与其它零样本文本分类方法相当，但能耗和碳排放却低得多。

Uncertainty Distillation: Teaching Language Models to Express Semantic Confidence

Authors: Sophia Hager, David Mueller, Kevin Duh, Nicholas Andrews

First: 2025-03-18T21:29:29+00:00 · Latest: 2025-12-12T16:31:27+00:00

Abs · PDF · Code1 · Code2

Abstract

As large language models (LLMs) are increasingly used for factual question-answering, it becomes more important for LLMs to have the capability to communicate the likelihood that their answer is correct. For these verbalized expressions of uncertainty to be meaningful, they should reflect the error rates at the expressed level of confidence. However, when prompted to express confidence, the error rates of current LLMs are inconsistent with their communicated confidences, highlighting the need for uncertainty quantification methods. Many prior methods calculate lexical uncertainty, estimating a model's confidence in the specific string it generated. In some cases, however, it may be more useful to estimate semantic uncertainty, or the model's confidence in the answer regardless of how it is verbalized. We propose a simple procedure, uncertainty distillation, to teach an LLM to verbalize calibrated semantic confidences. Using held-out data to map initial uncertainty estimates to meaningful probabilities, we create examples annotated with verbalized probabilities for supervised fine-tuning. We find that our method yields verbalized confidences that correlate well with observed error rates, even when compared to strong baselines, some of which are more than twenty times slower at inference time. Additionally, we demonstrate that our method can be applied to black-box models that allow API-based fine-tuning, resulting in estimates of uncertainty that are both more effective and more efficient than any of our baselines.

中文标题/摘要

标题：不确定性提炼：训练语言模型表达语义置信度

随着大型语言模型（LLMs）在事实问答中的应用越来越广泛，LLMs 具有传达其答案正确性的可能性变得越来越重要。为了使这些关于不确定性的口头表达有意义，它们应该反映在表达的置信水平下的错误率。然而，当被要求表达置信度时，当前 LLMs 的错误率与其传达的置信度不一致，突显了需要不确定性量化方法的必要性。许多先前的方法计算词汇不确定性，估计模型对其生成的具体字符串的信心。然而，在某些情况下，估计语义不确定性，即模型对其答案的信心，而不考虑其如何口头表达，可能更有用。我们提出了一种简单的程序——不确定性提炼，以训练 LLM 表达校准的语义置信度。利用保留的数据将初始不确定性估计映射到有意义的概率，我们创建了带有口头化概率注释的示例，用于监督微调。我们发现，我们的方法产生的口头置信度与观察到的错误率相关性良好，即使与强大的基线方法相比也是如此，有些基线方法在推理时间上慢了二十多倍。此外，我们展示了我们的方法可以应用于允许基于 API 微调的黑盒模型，从而产生比我们所有基线方法更有效且更高效的不确定性估计。

Summary / 总结

The research aims to improve the ability of large language models to express semantic confidence in their answers, which is crucial for factual question-answering. The method involves a procedure called uncertainty distillation, where the model is fine-tuned using held-out data to map initial uncertainty estimates to meaningful probabilities. This results in verbalized confidences that correlate well with observed error rates, outperforming strong baselines in both effectiveness and efficiency. Additionally, the method can be applied to black-box models for API-based fine-tuning, providing more accurate uncertainty estimates.

研究旨在提高大型语言模型在其答案中表达语义置信度的能力，这对于事实问答至关重要。方法是通过使用保留数据将初始不确定性估计映射到有意义的概率，训练模型。这导致表达的置信度与观察到的错误率高度相关，优于强大的基线模型，在效果和效率上都更胜一筹。该方法还可以应用于黑盒模型，增强其实用性。

Integrating Ontologies with Large Language Models for Enhanced Control Systems in Chemical Engineering

Authors: Crystal Su, Kuai Yu, Jingrui Zhang, Mingyuan Shao, Daniel Bauer

First: 2025-10-30T18:04:20+00:00 · Latest: 2025-12-12T16:14:17+00:00

Comments: This paper is withdrawn due to issues with attribution and citation accuracy

Abs · PDF · Code1 · Code2

Abstract

This work presents an ontology-integrated large language model (LLM) framework for chemical engineering that unites structured domain knowledge with generative reasoning. The proposed pipeline aligns model training and inference with the COPE ontology through a sequence of data acquisition, semantic preprocessing, information extraction, and ontology mapping steps, producing templated question-answer pairs that guide fine-tuning. A control-focused decoding stage and citation gate enforce syntactic and factual grounding by constraining outputs to ontology-linked terms, while evaluation metrics quantify both linguistic quality and ontological accuracy. Feedback and future extensions, including semantic retrieval and iterative validation, further enhance the system's interpretability and reliability. This integration of symbolic structure and neural generation provides a transparent, auditable approach for applying LLMs to process control, safety analysis, and other critical engineering contexts.

中文标题/摘要

标题：将本体与大型语言模型集成以增强化学工程中的控制系统

本文提出了一种将本体集成到大型语言模型（LLM）框架中以化学工程，将结构化领域知识与生成性推理相结合。所提出的流水线通过数据获取、语义预处理、信息提取和本体映射等一系列步骤，将模型训练和推理与COPE本体对齐，生成模板化的问答对，指导微调。专注于控制的解码阶段和引文门控通过限制输出到本体链接术语来强制语法和事实基础，而评估指标则量化语言质量和本体准确性。反馈和未来扩展，包括语义检索和迭代验证，进一步增强了系统的可解释性和可靠性。这种符号结构与神经生成的集成为过程控制、安全分析和其他关键工程环境提供了透明和可审计的方法。

Summary / 总结

This work introduces an ontology-integrated large language model framework for chemical engineering, combining structured domain knowledge with generative reasoning. The pipeline involves data acquisition, semantic preprocessing, information extraction, and ontology mapping to generate question-answer pairs for fine-tuning. The decoding stage and citation gate ensure syntactic and factual grounding, while evaluation metrics assess linguistic quality and ontological accuracy. Future extensions aim to improve interpretability and reliability through semantic retrieval and iterative validation. However, the paper is withdrawn due to issues with attribution and citation accuracy.

该研究提出了一种结合结构化领域知识和生成推理的大型语言模型框架，用于化学工程。该管道包括数据获取、语义预处理、信息提取和本体映射，生成用于微调的问题-答案对。解码阶段和引文门控确保输出与本体链接的术语相关，而评估指标衡量语言质量和本体准确性。未来扩展包括语义检索和迭代验证，以提高可解释性和可靠性。

MedRule-KG: A Knowledge-Graph--Steered Scaffold for Reliable Mathematical and Biomedical Reasoning

Authors: Crystal Su

First: 2025-11-17T04:42:52+00:00 · Latest: 2025-12-12T16:08:56+00:00

Comments: This paper is withdrawn due to issues with attribution and citation accuracy

Abs · PDF · Code1 · Code2

Abstract

We study how to impose domain-consistent structure on large language models (LLMs) used for scientific reasoning and early-stage drug discovery. We present MedRule-KG, a compact knowledge-graph scaffold paired with a lightweight verifier that steers generation toward mathematically and biomedically valid outputs. The system injects curated symbolic facts into prompts and then enforces rule satisfaction with a deterministic checker. We formalize generation as constrained inference, introduce a soft guidance surrogate suitable for decoding, and perform a thorough statistical analysis with uncertainty quantification. Across 90 tasks spanning reaction feasibility, metabolic compatibility, and toxicity screening, MedRule-KG reduces violation counts by 83.2\% relative to a strong chain-of-thought baseline while improving exact match. Results remain stable under stratification and scale with dataset size, and the verifier adds negligible latency, making the approach practical for interactive design.

中文标题/摘要

标题：MedRule-KG：一种知识图谱导向的框架，用于可靠地进行数学和生物医学推理

我们研究如何在用于科学推理和早期药物发现的大语言模型（LLMs）中施加领域一致的结构。我们提出了MedRule-KG，这是一种紧凑的知识图谱框架，配有一个轻量级验证器，引导生成符合数学和生物医学有效输出的内容。该系统将经过筛选的符号事实注入提示，然后使用确定性检查器强制执行规则满足。我们将生成视为受限推理，引入了适合解码的软指导替代方案，并进行了彻底的统计分析，包括不确定性量化。在涉及反应可行性、代谢兼容性和毒性筛查的90个任务中，MedRule-KG 相对于强大的链式思考基线将违反计数减少了83.2%，同时提高了精确匹配率。结果在分层后保持稳定，并随着数据集大小的增加而扩展，验证器增加了几乎可以忽略的延迟，使该方法适用于交互式设计。

Summary / 总结

The research aims to enhance the reliability of large language models in scientific reasoning and drug discovery by integrating domain-specific knowledge. MedRule-KG uses a compact knowledge graph and a lightweight verifier to guide the model towards mathematically and biomedically valid outputs. The system reduces violation counts by 83.2% compared to a strong chain-of-thought baseline while improving exact match accuracy. The approach is stable and scalable, with negligible latency added by the verifier, making it suitable for interactive design applications.

研究旨在通过引入领域特定知识来提高大型语言模型（LLMs）在科学研究和药物发现中的可靠性。MedRule-KG 使用知识图谱支架和轻量级验证器来引导生成数学和生物医学上有效的输出。该系统将违反规则的数量减少了 83.2%，同时提高了精确匹配的准确性。该方法在分层分析中保持稳定，并且随着数据集规模的扩大而扩展，验证器的延迟几乎可以忽略不计，使其适用于交互式设计。

MedRule-KG: A Knowledge-Graph--Steered Scaffold for Mathematical Reasoning with a Lightweight Verifier

Authors: Crystal Su

First: 2025-10-18T02:39:13+00:00 · Latest: 2025-12-12T16:08:36+00:00

Comments: This paper is withdrawn due to issues with attribution and citation accuracy

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) often produce fluent reasoning steps while violating simple mathematical or logical constraints. We introduce MedRule-KG, a compact typed knowledge graph coupled with a symbolic verifier, designed to enforce mathematically interpretable rules in reasoning tasks. MedRule-KG encodes entities, relations, and three domain-inspired rules, while the verifier checks predictions and applies minimal corrections to guarantee consistency. On a 90-example FDA-derived benchmark, grounding in MedRule-KG improves exact match (EM) from 0.767 to 0.900, and adding the verifier yields 1.000 EM while eliminating rule violations entirely. We demonstrate how MedRule-KG provides a general scaffold for safe mathematical reasoning, discuss ablations, and release code and data to encourage reproducibility.

中文标题/摘要

标题：MedRule-KG：一种由知识图谱引导的轻量级验证器支撑结构，用于数学推理

大型语言模型（LLMs）通常会产生流畅的推理步骤，但违反了简单的数学或逻辑约束。我们引入了MedRule-KG，这是一种紧凑的类型化知识图谱，结合了一个符号验证器，旨在在推理任务中强制执行可解释的数学规则。MedRule-KG 编码实体、关系和三个领域启发式规则，而验证器检查预测并应用最小的修正以确保一致性。在由90个例子组成的FDA衍生基准上，基于MedRule-KG 的准确匹配（EM）从0.767提高到0.900，添加验证器后EM达到1.000，同时完全消除了规则违反。我们展示了MedRule-KG 如何提供一个通用的框架以确保数学推理的安全性，讨论了消融实验，并发布了代码和数据以促进可重复性。

Summary / 总结

MedRule-KG is a knowledge graph-based system that includes a symbolic verifier to ensure mathematical consistency in reasoning tasks. It encodes entities, relations, and domain-specific rules, and the verifier corrects predictions to maintain consistency. On a benchmark of 90 examples, grounding in MedRule-KG improved exact match from 0.767 to 0.900, and adding the verifier achieved 1.000 exact match while eliminating rule violations. The system provides a general framework for safe mathematical reasoning and includes code and data for reproducibility.

MedRule-KG 是一个基于知识图谱的系统，结合了符号验证器以确保推理任务中的数学一致性。它编码实体、关系和领域特定规则，并通过验证器检查预测以保持一致性。在基准测试中，使用 MedRule-KG 提高了精确匹配率从 0.767 到 0.900，而添加验证器实现了 1.000 的精确匹配率并完全消除了规则违规。该系统提供了一种安全的数学推理的一般框架，并附带代码和数据以促进可重复性。

Depth-Copy-Paste: Multimodal and Depth-Aware Compositing for Robust Face Detection

Authors: Qiushi Guo

First: 2025-12-12T16:02:42+00:00 · Latest: 2025-12-12T16:02:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Data augmentation is crucial for improving the robustness of face detection systems, especially under challenging conditions such as occlusion, illumination variation, and complex environments. Traditional copy paste augmentation often produces unrealistic composites due to inaccurate foreground extraction, inconsistent scene geometry, and mismatched background semantics. To address these limitations, we propose Depth Copy Paste, a multimodal and depth aware augmentation framework that generates diverse and physically consistent face detection training samples by copying full body person instances and pasting them into semantically compatible scenes. Our approach first employs BLIP and CLIP to jointly assess semantic and visual coherence, enabling automatic retrieval of the most suitable background images for the given foreground person. To ensure high quality foreground masks that preserve facial details, we integrate SAM3 for precise segmentation and Depth-Anything to extract only the non occluded visible person regions, preventing corrupted facial textures from being used in augmentation. For geometric realism, we introduce a depth guided sliding window placement mechanism that searches over the background depth map to identify paste locations with optimal depth continuity and scale alignment. The resulting composites exhibit natural depth relationships and improved visual plausibility. Extensive experiments show that Depth Copy Paste provides more diverse and realistic training data, leading to significant performance improvements in downstream face detection tasks compared with traditional copy paste and depth free augmentation methods.

中文标题/摘要

标题：深度复制粘贴：多模态和深度感知合成以提高鲁棒性面部检测

数据增强对于提高面部检测系统的鲁棒性至关重要，尤其是在遮挡、光照变化和复杂环境等挑战性条件下。传统的复制粘贴增强往往会产生不现实的合成图像，因为前景提取不准确、场景几何不一致以及背景语义不匹配。为了解决这些限制，我们提出了一种多模态和深度感知的增强框架——深度复制粘贴，通过复制全身人体实例并将其粘贴到语义兼容的场景中，生成多样且物理上一致的面部检测训练样本。我们的方法首先使用BLIP和CLIP联合评估语义和视觉一致性，从而自动检索与给定前景人体最合适的背景图像。为了确保高质量的前景掩码以保留面部细节，我们结合了SAM3进行精确分割，并使用Depth-Anything仅提取未被遮挡的可见人体区域，防止在增强中使用损坏的面部纹理。为了实现几何现实感，我们引入了一种基于深度的滑动窗口放置机制，在背景深度图中搜索最佳的粘贴位置，以实现深度连续性和比例对齐。生成的合成图像表现出自然的深度关系和增强的视觉合理性。大量实验表明，深度复制粘贴提供了更多样且现实的训练数据，与传统的复制粘贴和无深度增强方法相比，在下游面部检测任务中取得了显著的性能提升。

Summary / 总结

The research aims to enhance the robustness of face detection systems by addressing the limitations of traditional data augmentation methods. Depth Copy Paste is proposed, a multimodal and depth-aware framework that generates realistic composites by copying full-body person instances and pasting them into semantically compatible scenes. Key findings include improved performance in face detection tasks, particularly under challenging conditions, compared to traditional and depth-free augmentation methods.

研究旨在通过解决传统复制粘贴数据增强方法的局限性，增强在复杂条件下的面部检测系统的鲁棒性。提出的深度复制粘贴框架使用多模态和深度感知技术生成逼真且多样的训练样本。它利用BLIP和CLIP实现语义和视觉一致性，结合SAM3进行精确分割，并使用Depth-Anything提取非遮挡区域。深度引导滑动窗口机制确保几何现实感。实验表明，深度复制粘贴相比传统方法提高了面部检测性能。

MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

Authors: Tim Cofala, Christian Kalfar, Jingge Xiao, Johanna Schrader, Michelle Tang, Wolfgang Nejdl

First: 2025-12-12T16:01:48+00:00 · Latest: 2025-12-12T16:01:48+00:00

Comments: 7 pages, 3 figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

Therapeutic decision-making in clinical medicine constitutes a high-stakes domain in which AI guidance interacts with complex interactions among patient characteristics, disease processes, and pharmacological agents. Tasks such as drug recommendation, treatment planning, and adverse-effect prediction demand robust, multi-step reasoning grounded in reliable biomedical knowledge. Agentic AI methods, exemplified by TxAgent, address these challenges through iterative retrieval-augmented generation (RAG). TxAgent employs a fine-tuned Llama-3.1-8B model that dynamically generates and executes function calls to a unified biomedical tool suite (ToolUniverse), integrating FDA Drug API, OpenTargets, and Monarch resources to ensure access to current therapeutic information. In contrast to general-purpose RAG systems, medical applications impose stringent safety constraints, rendering the accuracy of both the reasoning trace and the sequence of tool invocations critical. These considerations motivate evaluation protocols treating token-level reasoning and tool-usage behaviors as explicit supervision signals. This work presents insights derived from our participation in the CURE-Bench NeurIPS 2025 Challenge, which benchmarks therapeutic-reasoning systems using metrics that assess correctness, tool utilization, and reasoning quality. We analyze how retrieval quality for function (tool) calls influences overall model performance and demonstrate performance gains achieved through improved tool-retrieval strategies. Our work was awarded the Excellence Award in Open Science. Complete information can be found at https://curebench.ai/.

中文标题/摘要

标题：MedAI：在NeurIPS CURE-Bench竞赛中评估TxAgent的治疗代理推理

临床医学中的治疗决策构成一个高风险领域，在此领域中，AI指导与患者特征、疾病过程和药物剂型之间的复杂相互作用进行互动。药物推荐、治疗规划和不良反应预测等任务需要基于可靠生物医学知识的稳健、多步推理。代理AI方法，如TxAgent，通过迭代检索增强生成（RAG）来应对这些挑战。TxAgent 使用微调后的Llama-3.1-8B模型，动态生成并执行对统一生物医学工具套件（ToolUniverse）的功能调用，整合FDA药物API、OpenTargets和Monarch资源，以确保获取当前的治疗信息。与通用RAG系统不同，医疗应用施加了严格的安全部署约束，因此推理轨迹的准确性和工具调用序列的准确性至关重要。这些考虑促使评估协议将标记级推理和工具使用行为视为明确的监督信号。本文介绍了我们参加CURE-Bench NeurIPS 2025挑战赛的见解，该挑战赛使用评估正确性、工具使用和推理质量的指标来评估治疗推理系统。我们分析了功能（工具）调用检索质量对整体模型性能的影响，并展示了通过改进工具检索策略实现的性能提升。我们的工作获得了开放科学卓越奖。更多信息请参见https://curebench.ai/。

Summary / 总结

This study evaluates TxAgent's therapeutic reasoning capabilities in the NeurIPS CURE-Bench competition, focusing on its ability to integrate biomedical knowledge and tool usage for drug recommendation and adverse-effect prediction. TxAgent uses a fine-tuned Llama-3.1-8B model that dynamically generates and executes function calls to a unified biomedical tool suite, ensuring access to current therapeutic information. The research highlights the importance of accurate reasoning and tool usage, demonstrating performance gains through enhanced tool-retrieval strategies and receiving the Excellence Award in Open Science for its open methodology.

该研究评估了TxAgent在NeurIPS CURE-Bench竞赛中的治疗推理能力。TxAgent使用一个微调后的Llama-3.1-8B模型生成并执行功能调用，整合了FDA Drug API、OpenTargets和Monarch资源。研究重点在于推理和工具使用的准确性，表明改进的检索策略可以提升整体性能。该工作获得了开放科学卓越奖。

Cross-modal Context-aware Learning for Visual Prompt Guided Multimodal Image Understanding in Remote Sensing

Authors: Xu Zhang, Jiabin Fang, Zhuoming Ding, Jin Yuan, Xuan Liu, Qianjun Zhang, Zhiyong Li

First: 2025-12-12T15:59:49+00:00 · Latest: 2025-12-12T15:59:49+00:00

Comments: 12 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Recent advances in image understanding have enabled methods that leverage large language models for multimodal reasoning in remote sensing. However, existing approaches still struggle to steer models to the user-relevant regions when only simple, generic text prompts are available. Moreover, in large-scale aerial imagery many objects exhibit highly similar visual appearances and carry rich inter-object relationships, which further complicates accurate recognition. To address these challenges, we propose Cross-modal Context-aware Learning for Visual Prompt-Guided Multimodal Image Understanding (CLV-Net). CLV-Net lets users supply a simple visual cue, a bounding box, to indicate a region of interest, and uses that cue to guide the model to generate correlated segmentation masks and captions that faithfully reflect user intent. Central to our design is a Context-Aware Mask Decoder that models and integrates inter-object relationships to strengthen target representations and improve mask quality. In addition, we introduce a Semantic and Relationship Alignment module: a Cross-modal Semantic Consistency Loss enhances fine-grained discrimination among visually similar targets, while a Relationship Consistency Loss enforces alignment between textual relations and visual interactions. Comprehensive experiments on two benchmark datasets show that CLV-Net outperforms existing methods and establishes new state-of-the-art results. The model effectively captures user intent and produces precise, intention-aligned multimodal outputs.

中文标题/摘要

标题：遥感多模态图像理解中的跨模态上下文感知学习用于视觉提示引导

图像理解的最新进展使方法能够利用大型语言模型在遥感中进行多模态推理。然而，现有的方法在仅提供简单的通用文本提示时，仍然难以引导模型关注用户相关区域。此外，在大规模航空图像中，许多对象具有高度相似的视觉外观，并携带丰富的对象间关系，这进一步增加了准确识别的复杂性。为了解决这些挑战，我们提出了跨模态上下文感知学习用于视觉提示引导的多模态图像理解（CLV-Net）。CLV-Net 允许用户提供一个简单的视觉提示，一个边界框，以指示感兴趣的区域，并使用该提示引导模型生成与用户意图一致的相关分割掩码和描述。我们设计的核心是上下文感知掩码解码器，它建模并整合对象间关系以增强目标表示并提高掩码质量。此外，我们引入了语义和关系对齐模块：跨模态语义一致性损失增强了对视觉相似目标的细粒度区分，而关系一致性损失强制文本关系与视觉交互之间的对齐。在两个基准数据集上的全面实验表明，CLV-Net 超过了现有方法并建立了新的最先进的结果。该模型有效地捕捉了用户意图并产生了精确、意图一致的多模态输出。

Summary / 总结

The research aims to improve multimodal image understanding in remote sensing by addressing the challenge of guiding models with simple text prompts. CLV-Net proposes a Context-Aware Mask Decoder and introduces a Semantic and Relationship Alignment module to enhance model performance. Experiments demonstrate that CLV-Net outperforms existing methods and sets new state-of-the-art results by effectively capturing user intent and generating precise multimodal outputs.

研究旨在通过解决仅用简单文本提示引导模型的问题，提高遥感中的多模态图像理解。CLV-Net 使用边界框等视觉提示生成准确的分割掩码和描述。关键发现表明，CLV-Net 在基准数据集上的表现优于现有方法，并建立了新的最佳水平，有效捕捉用户意图并生成精确的多模态输出。

The Emergence of Complex Behavior in Large-Scale Ecological Environments

Authors: Joseph Bejjani, Chase Van Amburg, Chengrui Wang, Chloe Huangyuan Su, Sarah M. Pratt, Yasin Mazloumi, Naeem Khoshnevis, Sham M. Kakade, Kianté Brantley, Aaron Walsman

First: 2025-10-21T02:03:25+00:00 · Latest: 2025-12-12T15:48:59+00:00

Comments: 33 pages, 23 figures, 12 tables, experiment code available at https://github.com/jbejjani2022/ecological-emergent-behavior

Abs · PDF · Code1 · Code2 · Code3

Abstract

We explore how physical scale and population size shape the emergence of complex behaviors in open-ended ecological environments. In our setting, agents are unsupervised and have no explicit rewards or learning objectives but instead evolve over time according to reproduction, mutation, and selection. As they act, agents also shape their environment and the population around them in an ongoing dynamic ecology. Our goal is not to optimize a single high-performance policy, but instead to examine how behaviors emerge and evolve across large populations due to natural competition and environmental pressures. We use modern hardware along with a new multi-agent simulator to scale the environment and population to sizes much larger than previously attempted, reaching populations of over 60,000 agents, each with their own evolved neural network policy. We identify various emergent behaviors such as long-range resource extraction, vision-based foraging, and predation that arise under competitive and survival pressures. We examine how sensing modalities and environmental scale affect the emergence of these behaviors and find that some of them appear only in sufficiently large environments and populations, and that larger scales increase the stability and consistency of these emergent behaviors. While there is a rich history of research in evolutionary settings, our scaling results on modern hardware provide promising new directions to explore ecology as an instrument of machine learning in an era of increasingly abundant computational resources and efficient machine frameworks. Experimental code is available at https://github.com/jbejjani2022/ecological-emergent-behavior.

中文标题/摘要

标题：大型生态环境中复杂行为的涌现

我们探讨了物理尺度和种群规模如何塑造开放生态环境中复杂行为的涌现。在我们的设置中，代理是未监督的，没有明确的奖励或学习目标，而是根据繁殖、突变和选择随着时间进化。随着代理的行动，它们也在不断动态的生态中塑造其环境和周围的人口。我们的目标不是优化单一高性能策略，而是研究由于自然竞争和环境压力，复杂行为如何在大规模种群中涌现和进化。我们使用现代硬件和新的多代理模拟器来扩展环境和种群规模，达到超过60,000个代理，每个代理都有自己的进化神经网络策略。我们识别出各种涌现行为，如远程资源提取、基于视觉的觅食和捕食，这些行为在竞争和生存压力下出现。我们研究了感知模态和环境规模如何影响这些行为的涌现，并发现其中一些行为仅在足够大的环境中和种群中出现，而更大的规模增加了这些涌现行为的稳定性和一致性。尽管在进化设置中已有丰富的研究历史，但现代硬件上的扩展结果为将生态学作为机器学习工具提供了新的探索方向，在计算资源日益丰富和高效机器框架的时代。实验代码可在https://github.com/jbejjani2022/ecological-emergent-behavior获取。

Summary / 总结

This study investigates how physical scale and population size influence the emergence of complex behaviors in ecological environments through unsupervised agents that evolve over time through reproduction, mutation, and selection. Using a new multi-agent simulator and modern hardware, the research reaches populations of over 60,000 agents, identifying emergent behaviors such as long-range resource extraction, vision-based foraging, and predation. The study finds that these behaviors are more likely to emerge in larger environments and populations, and that larger scales increase the stability and consistency of these behaviors under competitive and survival pressures.

研究探讨了物理规模和种群大小如何影响在没有明确奖励或学习目标的情况下生态环境中复杂行为的出现。研究人员使用现代多智能体模拟器将环境和种群规模扩展到超过60,000个个体，每个个体都有其自己的进化神经网络策略。他们观察到了长距离资源提取、基于视觉的觅食和捕食等行为，这些行为在更大的种群中更为稳定和一致。研究结果表明，某些行为仅在足够大的环境中和种群中才能出现并稳定。这项研究为利用日益丰富的计算资源和高效的机器框架在机器学习中使用生态学提供了新的方向。

Bridging Streaming Continual Learning via In-Context Large Tabular Models

Authors: Afonso Lourenço, João Gama, Eric P. Xing, Goreti Marreiros

Venue: AAAI

First: 2025-12-12T15:47:26+00:00 · Latest: 2025-12-12T15:47:26+00:00

Comments: Streaming Continual Learning AAAI Bridge 2026

Abs · PDF · Code1 · Code2

Abstract

In streaming scenarios, models must learn continuously, adapting to concept drifts without erasing previously acquired knowledge. However, existing research communities address these challenges in isolation. Continual Learning (CL) focuses on long-term retention and mitigating catastrophic forgetting, often without strict real-time constraints. Stream Learning (SL) emphasizes rapid, efficient adaptation to high-frequency data streams, but typically neglects forgetting. Recent efforts have tried to combine these paradigms, yet no clear algorithmic overlap exists. We argue that large in-context tabular models (LTMs) provide a natural bridge for Streaming Continual Learning (SCL). In our view, unbounded streams should be summarized on-the-fly into compact sketches that can be consumed by LTMs. This recovers the classical SL motivation of compressing massive streams with fixed-size guarantees, while simultaneously aligning with the experience-replay desiderata of CL. To clarify this bridge, we show how the SL and CL communities implicitly adopt a divide-to-conquer strategy to manage the tension between plasticity (performing well on the current distribution) and stability (retaining past knowledge), while also imposing a minimal complexity constraint that motivates diversification (avoiding redundancy in what is stored) and retrieval (re-prioritizing past information when needed). Within this perspective, we propose structuring SCL with LTMs around two core principles of data selection for in-context learning: (1) distribution matching, which balances plasticity and stability, and (2) distribution compression, which controls memory size through diversification and retrieval mechanisms.

中文标题/摘要

标题：通过大型表格模型实现流式连续学习的桥梁

在流式场景中，模型必须持续学习，适应概念漂移而不抹去之前获得的知识。然而，现有的研究社区在解决这些挑战时是孤立的。连续学习（CL）侧重于长期保留并缓解灾难性遗忘，通常没有严格的实时约束。流式学习（SL）强调快速、高效地适应高频数据流，但通常忽视遗忘。最近的努力试图将这些范式结合起来，但没有明确的算法重叠。我们认为，大型上下文中的表格模型（LTMs）为流式连续学习（SCL）提供了一种自然的桥梁。在我们的观点中，无界的流应该实时总结为紧凑的草图，可以被LTMs消费。这恢复了经典SL动机，即用固定大小的保证压缩庞大的流，同时同时与CL的经验回放需求保持一致。为了阐明这种桥梁，我们展示了SL和CL社区如何隐式采用分而治之的策略来管理塑性（在当前分布上表现良好）和稳定性（保留过去知识）之间的张力，同时施加一个最小的复杂性约束，这激励了多样化（避免存储中的冗余）和检索（在需要时重新优先考虑过去的信息）。从这个角度来看，我们建议用LTMs围绕数据选择的两个核心原则来结构化SCL：（1）分布匹配，平衡塑性和稳定性；（2）分布压缩，通过多样化和检索机制控制内存大小。

Summary / 总结

The paper addresses the challenge of Streaming Continual Learning (SCL) by proposing the use of large in-context tabular models (LTMs) to bridge the gap between Continual Learning (CL) and Stream Learning (SL). The method involves summarizing unbounded data streams into compact sketches that can be processed by LTMs, thus combining the efficiency of SL with the knowledge retention of CL. Key findings show that LTMs can effectively manage the trade-off between plasticity and stability, and control memory size through diversification and retrieval mechanisms, thereby providing a natural solution for SCL.

论文通过提出使用大型上下文表型模型（LTMs）来解决流式持续学习（SCL）的挑战，旨在弥合持续学习（CL）和流式学习（SL）之间的差距。它认为LTMs可以将无界数据流压缩为紧凑的摘要，同时满足CL和SL的目标。关键发现包括SL和CL社区在管理可塑性和稳定性的隐式策略，以及提出了SCL的两个核心原则：分布匹配和分布压缩。

From Verification Burden to Trusted Collaboration: Design Goals for LLM-Assisted Literature Reviews

Authors: Brenda Nogueira, Werner Geyer, Andrew Anderson, Toby Jia-Jun Li, Dongwhi Kim, Nuno Moniz, Nitesh V. Chawla

First: 2025-12-12T15:38:34+00:00 · Latest: 2025-12-12T15:38:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) are increasingly embedded in academic writing practices. Although numerous studies have explored how researchers employ these tools for scientific writing, their concrete implementation, limitations, and design challenges within the literature review process remain underexplored. In this paper, we report a user study with researchers across multiple disciplines to characterize current practices, benefits, and \textit{pain points} in using LLMs to investigate related work. We identified three recurring gaps: (i) lack of trust in outputs, (ii) persistent verification burden, and (iii) requiring multiple tools. This motivates our proposal of six design goals and a high-level framework that operationalizes them through improved related papers visualization, verification at every step, and human-feedback alignment with generation-guided explanations. Overall, by grounding our work in the practical, day-to-day needs of researchers, we designed a framework that addresses these limitations and models real-world LLM-assisted writing, advancing trust through verifiable actions and fostering practical collaboration between researchers and AI systems.

中文标题/摘要

标题：从验证负担到信任合作：LLM辅助文献综述的设计目标

大型语言模型（LLMs）越来越多地嵌入到学术写作实践中。尽管已有许多研究探讨了研究人员如何使用这些工具进行科学写作，但它们在文献综述过程中的具体实现、局限性和设计挑战仍较少被研究。在本文中，我们报告了一项跨学科研究人员的用户研究，以描述使用LLMs调查相关工作时的当前实践、益处和\textit{痛点}。我们确定了三个反复出现的缺口：(i) 对输出缺乏信任，(ii) 持续的验证负担，(iii) 需要多种工具。这促使我们提出六项设计目标和一个高层次框架，通过改进相关论文可视化、每一步验证和人类反馈与生成引导解释的对齐来实现它们。总体而言，通过将我们的工作扎根于研究人员的实际日常需求，我们设计了一个框架来解决这些局限性，并通过可验证的行为建立信任，促进研究人员与AI系统的实际合作。

Summary / 总结

This paper explores the challenges and design goals for using Large Language Models (LLMs) in literature reviews, based on a user study with researchers from various disciplines. The study highlights three main issues: lack of trust in outputs, persistent verification burden, and the need for multiple tools. To address these, the authors propose six design goals, including improved visualization of related papers, step-by-step verification, and alignment with human feedback through generation-guided explanations, aiming to enhance trust and practical collaboration between researchers and AI systems.

本文探讨了在学术文献综述中使用大型语言模型（LLMs）所面临的挑战，重点关注信任问题和验证负担。通过一项用户研究，作者确定了三个主要痛点：输出缺乏信任、持续的验证需求以及需要多种工具。为了解决这些问题，他们提出了六个设计目标和一个框架，该框架通过增强可视化、在每个步骤中确保验证以及将人类反馈与生成解释对齐来提高信任，并促进研究人员与AI系统的实际合作。

Kinetic Mining in Context: Few-Shot Action Synthesis via Text-to-Motion Distillation

Authors: Luca Cazzola, Ahed Alboody

First: 2025-12-12T15:32:28+00:00 · Latest: 2025-12-12T15:32:28+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

The acquisition cost for large, annotated motion datasets remains a critical bottleneck for skeletal-based Human Activity Recognition (HAR). Although Text-to-Motion (T2M) generative models offer a compelling, scalable source of synthetic data, their training objectives, which emphasize general artistic motion, and dataset structures fundamentally differ from HAR's requirements for kinematically precise, class-discriminative actions. This disparity creates a significant domain gap, making generalist T2M models ill-equipped for generating motions suitable for HAR classifiers. To address this challenge, we propose KineMIC (Kinetic Mining In Context), a transfer learning framework for few-shot action synthesis. KineMIC adapts a T2M diffusion model to an HAR domain by hypothesizing that semantic correspondences in the text encoding space can provide soft supervision for kinematic distillation. We operationalize this via a kinetic mining strategy that leverages CLIP text embeddings to establish correspondences between sparse HAR labels and T2M source data. This process guides fine-tuning, transforming the generalist T2M backbone into a specialized few-shot Action-to-Motion generator. We validate KineMIC using HumanML3D as the source T2M dataset and a subset of NTU RGB+D 120 as the target HAR domain, randomly selecting just 10 samples per action class. Our approach generates significantly more coherent motions, providing a robust data augmentation source that delivers a +23.1% accuracy points improvement. Animated illustrations and supplementary materials are available at (https://lucazzola.github.io/publications/kinemic).

中文标题/摘要

标题：运动语境下的动能采矿：基于文本到运动蒸馏的少样本动作合成

大型注释运动数据集的获取成本仍然是基于骨架的人体活动识别（HAR）的关键瓶颈。尽管文本到运动（T2M）生成模型提供了具有吸引力且可扩展的合成数据来源，但它们的训练目标强调一般艺术运动，而数据集结构与HAR对精确的运动学动作和类间区分性的要求存在根本差异。这种差异造成了显著的领域差距，使得通用的T2M模型无法生成适合HAR分类器的动作。为了解决这一挑战，我们提出了KineMIC（动能采矿在上下文中的应用），一种少样本动作合成的迁移学习框架。KineMIC通过假设文本编码空间中的语义对应可以为运动学蒸馏提供软监督，将T2M扩散模型适应到HAR领域。我们通过一种动能采矿策略，利用CLIP文本嵌入来建立稀疏HAR标签与T2M源数据之间的对应关系，指导微调，将通用的T2M主干转化为专门的少样本动作到运动生成器。我们使用HumanML3D作为源T2M数据集，NTU RGB+D 120的部分作为目标HAR领域，随机选择每个动作类10个样本。我们的方法生成了更加连贯的动作，提供了稳健的数据增强来源，提高了23.1%的准确率。动画示例和补充材料可在(https://lucazzola.github.io/publications/kinemic)获取。

Summary / 总结

The paper addresses the challenge of acquiring large, annotated motion datasets for Human Activity Recognition (HAR) by proposing KineMIC, a transfer learning framework that adapts a Text-to-Motion (T2M) diffusion model to generate kinematically precise actions suitable for HAR classifiers. By leveraging semantic correspondences in text embeddings, KineMIC fine-tunes the T2M model to produce more coherent motions, achieving a 23.1% improvement in accuracy when used for data augmentation in HAR tasks.

论文旨在使用Text-to-Motion (T2M) 模型生成适合人体活动识别（HAR）的精确动作。KineMIC 是一种迁移学习框架，通过利用文本嵌入中的语义对应关系来引导微调，将通用的T2M模型转变为专门的少量样本动作生成器。该方法生成的动作更为连贯，在NTU RGB+D 120的子集上，仅使用每个动作类别的10个样本，HAR准确率提高了23.1%。

An effective control of large systems of active particles: An application to evacuation problem

Authors: Albina Klepach, Egor E. Nuzhin, Alexey A. Tsukanov, Nikolay V. Brilliantov

First: 2025-09-24T10:27:45+00:00 · Latest: 2025-12-12T14:51:16+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Manipulation of large systems of active particles is a serious challenge across diverse domains, including crowd management, control of robotic swarms, and coordinated material transport. The development of advanced control strategies for complex scenarios is hindered, however, by the lack of scalability and robustness of the existing methods, in particular, due to the need of an individual control for each agent. One possible solution involves controlling a system through a leader or a group of leaders, which other agents tend to follow. Using such an approach we develop an effective control strategy for a leader, combining reinforcement learning (RL) with artificial forces acting on the system. To describe the guidance of active particles by a leader we introduce the generalized Vicsek model. This novel method is then applied to the problem of the effective evacuation by a robot-rescuer (leader) of large groups of people from hazardous places. We demonstrate, that while a straightforward application of RL yields suboptimal results, even for advanced architectures, our approach provides a robust and efficient evacuation strategy. The source code supporting this study is publicly available at: https://github.com/cinemere/evacuation.

中文标题/摘要

标题：大型活性粒子系统的有效控制：以疏散问题为例

对大型活性粒子系统的操控在多个领域都是一项严峻的挑战，包括人群管理、机器人群的控制以及协调物质运输等。然而，由于现有方法缺乏可扩展性和鲁棒性，特别是在需要对每个代理进行单独控制的情况下，开发适用于复杂场景的高级控制策略受到了阻碍。一种可能的解决方案是通过领导者或一组领导者来控制系统，其他代理倾向于跟随领导者。使用这种方法，我们开发了一种结合强化学习（RL）和作用于系统的虚拟力的有效控制策略。为了描述领导者对活性粒子的引导，我们引入了广义维谢克模型。然后，我们将这种方法应用于机器人救援者（领导者）有效疏散大量人群的问题。我们证明，即使对于先进的架构，直接应用RL也会导致次优结果，而我们的方法则提供了稳健且高效的疏散策略。支持本研究的源代码可在以下网址获取：https://github.com/cinemere/evacuation.

Summary / 总结

The paper addresses the challenge of controlling large systems of active particles, such as crowds or robotic swarms, by developing a control strategy that uses a leader to guide the system. The method combines reinforcement learning with artificial forces and is applied to the evacuation problem. While simple RL approaches are suboptimal, the proposed approach offers a robust and efficient evacuation strategy for large groups of people.

论文旨在通过使用领导者来引导大型活性粒子系统（如人群或机器人集群）来解决控制难题。方法结合了强化学习和人工力，并应用于疏散问题。虽然简单的RL方法效果不佳，但提出的方案能够为大规模人群提供稳健且高效的疏散策略。