Modeling LLM Agent Reviewer Dynamics in Elo-Ranked Review System
Authors: Hsiang-Wei Huang, Junbin Lu, Kuang-Ming Chen, Jenq-Neng Hwang
First: 2026-01-13T18:59:17+00:00 · Latest: 2026-01-13T18:59:17+00:00
Comments: In submission. The first two authors contributed equally
Abstract
In this work, we explore the Large Language Model (LLM) agent reviewer dynamics in an Elo-ranked review system using real-world conference paper submissions. Multiple LLM agent reviewers with different personas are engage in multi round review interactions moderated by an Area Chair. We compare a baseline setting with conditions that incorporate Elo ratings and reviewer memory. Our simulation results showcase several interesting findings, including how incorporating Elo improves Area Chair decision accuracy, as well as reviewers' adaptive review strategy that exploits our Elo system without improving review effort. Our code is available at https://github.com/hsiangwei0903/EloReview.
中文标题/摘要
标题:基于Elo排名评审系统的大型语言模型代理评审员动态建模
在本研究中,我们使用实际会议论文提交数据,探索基于Elo排名的评审系统中大型语言模型(LLM)代理评审员的动态。多个具有不同人设的LLM代理评审员在区域主席的主持下进行多轮评审互动。我们比较了基准设置与包含Elo评分和评审员记忆的条件。我们的模拟结果展示了几个有趣的研究发现,包括如何引入Elo评分提高区域主席决策准确性,以及评审员如何利用我们的Elo系统调整评审策略而不增加评审努力。我们的代码可在https://github.com/hsiangwei0903/EloReview获取。
Summary / 总结
This study investigates the dynamics of Large Language Model (LLM) agent reviewers in an Elo-ranked review system using real-world conference submissions. Multiple LLM reviewers with different personas engage in multi-round interactions moderated by an Area Chair. The research compares a baseline setting with conditions that include Elo ratings and reviewer memory. Key findings include improved Area Chair decision accuracy with Elo ratings and reviewers' adaptive strategies that exploit the Elo system without increasing effort. The code is available on GitHub.
本研究使用真实会议论文提交数据,探讨大型语言模型(LLM)代理评审人在采用Elo排名的评审系统中的动态。多个具有不同人设的LLM评审人参与多轮评审互动,由领域主席主持。研究将基准设置与包含Elo评分和评审人记忆的条件进行比较。主要发现包括Elo评分提高了领域主席的决策准确性,以及评审人利用Elo系统调整策略但未增加评审努力。代码可在GitHub上获得。
Motion Attribution for Video Generation
Authors: Xindi Wu, Despoina Paschalidou, Jun Gao, Antonio Torralba, Laura Leal-Taixé, Olga Russakovsky, Sanja Fidler, Jonathan Lorraine
First: 2026-01-13T18:59:09+00:00 · Latest: 2026-01-13T18:59:09+00:00
Comments: See the project website at https://research.nvidia.com/labs/sil/projects/MOTIVE/
Abstract
Despite the rapid progress of video generation models, the role of data in influencing motion is poorly understood. We present Motive (MOTIon attribution for Video gEneration), a motion-centric, gradient-based data attribution framework that scales to modern, large, high-quality video datasets and models. We use this to study which fine-tuning clips improve or degrade temporal dynamics. Motive isolates temporal dynamics from static appearance via motion-weighted loss masks, yielding efficient and scalable motion-specific influence computation. On text-to-video models, Motive identifies clips that strongly affect motion and guides data curation that improves temporal consistency and physical plausibility. With Motive-selected high-influence data, our method improves both motion smoothness and dynamic degree on VBench, achieving a 74.1% human preference win rate compared with the pretrained base model. To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models and to use it to curate fine-tuning data.
中文标题/摘要
标题:视频生成中的运动归因
尽管视频生成模型取得了快速进展,但数据对运动的影响作用尚不明确。我们提出了Motive(运动归因于视频生成),一种以运动为中心、基于梯度的数据归因框架,可扩展到现代大型高质量视频数据集和模型。我们使用该框架研究哪些微调片段能改善或降低时间动态性。Motive通过运动加权损失掩码将时间动态性与静态外观分离,从而实现高效且可扩展的运动特定影响计算。在文本到视频模型上,Motive识别出对运动有强烈影响的片段,并指导数据整理以提高时间一致性和物理合理性。使用Motive选择的高影响数据,我们的方法在VBench上提高了运动平滑度和动态程度,与预训练基模型相比,获得了74.1%的人类偏好胜率。据我们所知,这是第一个在视频生成模型中归因运动而非视觉外观的框架,并且使用它来整理微调数据。
Summary / 总结
The research aims to understand the role of data in influencing motion in video generation models. Motive, a motion-centric gradient-based framework, is introduced to study the impact of fine-tuning clips on temporal dynamics. It isolates temporal dynamics from static appearance, improving motion smoothness and dynamic degree. Using Motive, the method achieves a 74.1% human preference win rate on VBench compared to the pretrained base model, enhancing temporal consistency and physical plausibility.
研究旨在理解数据在影响视频生成模型中运动角色的作用。提出了Motive,一种基于梯度的运动中心框架,用于研究微调片段对时间动态的影响。关键发现表明,Motive 提高了运动流畅性和动态程度,在 VBench 上的人类偏好胜出率为 74.1%,优于预训练基模型。这是第一个在视频生成模型中归因于运动而非视觉外观的框架,并用于数据整理。
FilmSceneDesigner: Chaining Set Design for Procedural Film Scene Generation
Authors: Zhifeng Xie, Keyi Zhang, Yiye Yan, Yuling Guo, Fan Yang, Jiting Zhou, Mengtian Li
First: 2025-11-24T14:00:40+00:00 · Latest: 2026-01-13T18:51:21+00:00
Abstract
Film set design plays a pivotal role in cinematic storytelling and shaping the visual atmosphere. However, the traditional process depends on expert-driven manual modeling, which is labor-intensive and time-consuming. To address this issue, we introduce FilmSceneDesigner, an automated scene generation system that emulates professional film set design workflow. Given a natural language description, including scene type, historical period, and style, we design an agent-based chaining framework to generate structured parameters aligned with film set design workflow, guided by prompt strategies that ensure parameter accuracy and coherence. On the other hand, we propose a procedural generation pipeline which executes a series of dedicated functions with the structured parameters for floorplan and structure generation, material assignment, door and window placement, and object retrieval and layout, ultimately constructing a complete film scene from scratch. Moreover, to enhance cinematic realism and asset diversity, we construct SetDepot-Pro, a curated dataset of 6,862 film-specific 3D assets and 733 materials. Experimental results and human evaluations demonstrate that our system produces structurally sound scenes with strong cinematic fidelity, supporting downstream tasks such as virtual previs, construction drawing and mood board creation.
中文标题/摘要
标题:FilmSceneDesigner:程序化电影场景生成中的场景设计链式连接
电影场景设计在电影叙事和视觉氛围塑造中起着关键作用。然而,传统的流程依赖于专家驱动的手动建模,这既耗时又费力。为了解决这一问题,我们引入了FilmSceneDesigner,这是一种自动场景生成系统,模拟了专业的电影场景设计工作流程。给定自然语言描述,包括场景类型、历史时期和风格,我们设计了一个基于代理的链式框架,生成与电影场景设计工作流程相匹配的结构化参数,通过提示策略确保参数的准确性和连贯性。另一方面,我们提出了一种程序化生成流水线,该流水线执行一系列专用功能,使用结构化参数进行平面图和结构生成、材料分配、门窗布置以及对象检索和布局,最终从头构建一个完整的电影场景。此外,为了增强电影的真实感和资产多样性,我们构建了SetDepot-Pro,这是一个包含6,862个电影特定的3D资产和733种材料的精选数据集。实验结果和人类评估表明,我们的系统生成了结构合理且具有强烈电影真实感的场景,支持下游任务如虚拟预览、施工图纸和情绪板创建。
Summary / 总结
FilmSceneDesigner is an automated system that generates film scenes from natural language descriptions by chaining set design processes. It uses an agent-based framework to generate structured parameters and a procedural pipeline to create floorplans, assign materials, place doors and windows, and layout objects. The system leverages SetDepot-Pro, a dataset of 6,862 3D assets and 733 materials, to enhance realism. Experimental results show that the generated scenes are structurally sound and have high cinematic fidelity, supporting various downstream tasks.
FilmSceneDesigner 是一个自动化系统,旨在解决传统电影布景设计劳动密集的问题。它使用基于代理的链式框架根据自然语言描述生成结构化参数,并使用程序生成流水线来创建平面图、结构、材料和对象布局。该系统利用包含6,862个电影特定3D资产和733种材料的SetDepot-Pro数据集来增强现实感。实验结果表明,FilmSceneDesigner生成的布景结构合理且具有强烈的电影忠实度,适用于虚拟预览、施工图纸和情绪板创建等下游任务。
MemRec: Collaborative Memory-Augmented Agentic Recommender System
Authors: Weixin Chen, Yuhan Zhao, Jingyuan Huang, Zihe Ye, Clark Mingxuan Ju, Tong Zhao, Neil Shah, Li Chen, Yongfeng Zhang
First: 2026-01-13T18:51:16+00:00 · Latest: 2026-01-13T18:51:16+00:00
Abstract
The evolution of recommender systems has shifted preference storage from rating matrices and dense embeddings to semantic memory in the agentic era. Yet existing agents rely on isolated memory, overlooking crucial collaborative signals. Bridging this gap is hindered by the dual challenges of distilling vast graph contexts without overwhelming reasoning agents with cognitive load, and evolving the collaborative memory efficiently without incurring prohibitive computational costs. To address this, we propose MemRec, a framework that architecturally decouples reasoning from memory management to enable efficient collaborative augmentation. MemRec introduces a dedicated, cost-effective LM_Mem to manage a dynamic collaborative memory graph, serving synthesized, high-signal context to a downstream LLM_Rec. The framework operates via a practical pipeline featuring efficient retrieval and cost-effective asynchronous graph propagation that evolves memory in the background. Extensive experiments on four benchmarks demonstrate that MemRec achieves state-of-the-art performance. Furthermore, architectural analysis confirms its flexibility, establishing a new Pareto frontier that balances reasoning quality, cost, and privacy through support for diverse deployments, including local open-source models. Code:https://github.com/rutgerswiselab/memrec and Homepage: https://memrec.weixinchen.com
中文标题/摘要
标题:MemRec:协作记忆增强自主推荐系统
推荐系统的发展已将偏好存储从评分矩阵和密集嵌入转向自主时代中的语义记忆。然而,现有的代理依赖于孤立的记忆,忽视了关键的协作信号。弥合这一差距受到双重挑战的阻碍:一是如何在不使推理代理的认知负担过重的情况下提炼庞大的图上下文,二是如何高效地进化协作记忆而不产生高昂的计算成本。为了解决这个问题,我们提出了MemRec,这是一种架构上将推理与内存管理解耦的框架,以实现高效的协作增强。MemRec引入了一种专用且成本效益高的LM_Mem来管理动态的协作记忆图,并向下游的LLM_Rec提供合成的高信号上下文。该框架通过高效的检索和成本效益高的异步图传播操作,实现背景中的记忆进化。在四个基准上的广泛实验表明,MemRec达到了最先进的性能。此外,架构分析证实了其灵活性,通过支持多种部署,包括本地开源模型,建立了推理质量、成本和隐私的新帕累托前沿。代码:https://github.com/rutgerswiselab/memrec 和主页:https://memrec.weixinchen.com
Summary / 总结
MemRec is designed to enhance recommender systems by integrating collaborative memory into the agentic era, addressing the limitations of isolated memory usage. It proposes a framework that decouples reasoning from memory management, using a cost-effective LM_Mem to manage a dynamic collaborative memory graph and serving it to a downstream LLM_Rec. Experiments on four benchmarks show that MemRec outperforms existing methods, and architectural analysis confirms its flexibility and cost-effectiveness, setting a new Pareto frontier for reasoning quality, cost, and privacy.
MemRec旨在通过整合协作记忆到智能推荐系统中,解决孤立记忆的局限性。它提出了一种框架,将推理与记忆管理分离,使用成本效益高的LM_Mem管理动态协作记忆图,并将其提供给下游的LLM_Rec。在四个基准上的实验表明,MemRec在性能上超越了现有方法,并且架构分析证实了其灵活性和成本效益,为推理质量、成本和隐私设定了新的帕累托前沿。
Reasoning Matters for 3D Visual Grounding
Authors: Hsiang-Wei Huang, Kuang-Ming Chen, Wenhao Chai, Cheng-Yen Yang, Jen-Hao Cheng, Jenq-Neng Hwang
Venue: CVPR
First: 2026-01-13T18:48:41+00:00 · Latest: 2026-01-13T18:48:41+00:00
Comments: 2025 CVPR Workshop on 3D-LLM/VLA: Bridging Language, Vision and Action in 3D Environments
Abstract
The recent development of Large Language Models (LLMs) with strong reasoning ability has driven research in various domains such as mathematics, coding, and scientific discovery. Meanwhile, 3D visual grounding, as a fundamental task in 3D understanding, still remains challenging due to the limited reasoning ability of recent 3D visual grounding models. Most of the current methods incorporate a text encoder and visual feature encoder to generate cross-modal fuse features and predict the referring object. These models often require supervised training on extensive 3D annotation data. On the other hand, recent research also focus on scaling synthetic data to train stronger 3D visual grounding LLM, however, the performance gain remains limited and non-proportional to the data collection cost. In this work, we propose a 3D visual grounding data pipeline, which is capable of automatically synthesizing 3D visual grounding data along with corresponding reasoning process. Additionally, we leverage the generated data for LLM fine-tuning and introduce Reason3DVG-8B, a strong 3D visual grounding LLM that outperforms previous LLM-based method 3D-GRAND using only 1.6% of their training data, demonstrating the effectiveness of our data and the importance of reasoning in 3D visual grounding.
中文标题/摘要
标题:3D视觉定位中的推理很重要
近年来,具有强大推理能力的大语言模型(LLMs)在数学、编程和科学发现等多个领域推动了研究进展。与此同时,作为三维理解基本任务的3D视觉定位仍然具有挑战性,因为当前的3D视觉定位模型推理能力有限。大多数现有方法结合了文本编码器和视觉特征编码器以生成跨模态融合特征并预测引用对象。这些模型通常需要在大量的3D标注数据上进行监督训练。另一方面,最近的研究也集中在通过扩展合成数据来训练更强的3D视觉定位LLMs,然而性能提升有限且不成比例于数据收集成本。在本工作中,我们提出了一种3D视觉定位数据管道,能够自动合成3D视觉定位数据及其相应的推理过程。此外,我们利用生成的数据对LLM进行微调,并引入了Reason3DVG-8B,这是一种强3D视觉定位LLM,仅使用3D-GRAND训练数据的1.6%就超越了之前的基于LLM的方法,证明了我们数据的有效性和推理在3D视觉定位中的重要性。
Summary / 总结
This work addresses the challenge of 3D visual grounding by proposing a data pipeline that automatically synthesizes 3D visual grounding data along with reasoning processes. The authors leverage this data to fine-tune a Large Language Model (LLM) and introduce Reason3DVG-8B, which outperforms previous LLM-based methods like 3D-GRAND using only 1.6% of their training data, highlighting the importance of reasoning in 3D visual grounding.
本文提出了一种数据管道,能够自动生成包含推理过程的3D视觉定位数据,并用于微调大型语言模型,生成了Reason3DVG-8B,该模型仅使用之前方法3D-GRAND训练数据的1.6%就超越了它们,这表明推理在3D视觉定位中的重要性。
Multiplex Thinking: Reasoning via Token-wise Branch-and-Merge
Authors: Yao Tang, Li Dong, Yaru Hao, Qingxiu Dong, Furu Wei, Jiatao Gu
First: 2026-01-13T18:48:00+00:00 · Latest: 2026-01-13T18:48:00+00:00
Comments: 21 pages. Code available at https://github.com/GMLR-Penn/Multiplex-Thinking
Abstract
Large language models often solve complex reasoning tasks more effectively with Chain-of-Thought (CoT), but at the cost of long, low-bandwidth token sequences. Humans, by contrast, often reason softly by maintaining a distribution over plausible next steps. Motivated by this, we propose Multiplex Thinking, a stochastic soft reasoning mechanism that, at each thinking step, samples K candidate tokens and aggregates their embeddings into a single continuous multiplex token. This preserves the vocabulary embedding prior and the sampling dynamics of standard discrete generation, while inducing a tractable probability distribution over multiplex rollouts. Consequently, multiplex trajectories can be directly optimized with on-policy reinforcement learning (RL). Importantly, Multiplex Thinking is self-adaptive: when the model is confident, the multiplex token is nearly discrete and behaves like standard CoT; when it is uncertain, it compactly represents multiple plausible next steps without increasing sequence length. Across challenging math reasoning benchmarks, Multiplex Thinking consistently outperforms strong discrete CoT and RL baselines from Pass@1 through Pass@1024, while producing shorter sequences. The code and checkpoints are available at https://github.com/GMLR-Penn/Multiplex-Thinking.
中文标题/摘要
标题:多路思考:基于令牌级分支与合并的推理
大型语言模型通常通过思维链(CoT)更有效地解决复杂推理任务,但代价是长而低带宽的令牌序列。相比之下,人类往往通过保持可能下一步的分布来进行软推理。受此启发,我们提出了一种多路思考机制,该机制在每次思考步骤中采样K个候选令牌,并将它们的嵌入聚合为一个连续的多路令牌。这保留了词汇嵌入先验和标准离散生成的采样动态,同时诱导了多路展开的可处理概率分布。因此,多路轨迹可以直接通过在线强化学习(RL)进行优化。重要的是,多路思考是自适应的:当模型自信时,多路令牌几乎离散,类似于标准CoT;当它不确定时,它可以紧凑地表示多个可能的下一步,而不增加序列长度。在具有挑战性的数学推理基准测试中,多路思考在从Pass@1到Pass@1024的所有指标上都优于强大的离散CoT和RL基线,同时生成更短的序列。代码和检查点可在https://github.com/GMLR-Penn/Multiplex-Thinking获取。
Summary / 总结
The research aims to improve the reasoning capabilities of large language models by proposing Multiplex Thinking, a mechanism that samples K candidate tokens at each step and merges their embeddings into a single multiplex token. This approach retains the benefits of Chain-of-Thought reasoning while allowing for a more flexible and adaptive reasoning process. The method outperforms strong discrete CoT and RL baselines on math reasoning benchmarks, producing shorter sequences with higher accuracy across various task complexities.
研究旨在通过提出Multiplex Thinking机制来提高大型语言模型的推理效率和效果,该机制在每一步采样K个候选令牌并将其嵌入合并为一个单一的多路复用令牌。这种方法保留了离散生成的词汇嵌入先验和采样动态,同时允许对多路复用展开的概率分布进行可处理的表示。实验结果表明,Multiplex Thinking在各种数学推理基准测试中优于强大的离散链式思考和强化学习基线,生成更短的序列,同时在从Pass@1到Pass@1024的通过率上表现出色。
LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services
Authors: Hang He, Chuhuai Yue, Chengqi Dong, Mingxue Tian, Hao Chen, Zhenfeng Liu, Jiajun Chai, Xiaohan Wang, Yufei Zhang, Qun Liao, Guojun Yin, Wei Lin, Chengcheng Wan, Haiying Sun, Ting Su
First: 2025-12-08T11:12:39+00:00 · Latest: 2026-01-13T18:44:27+00:00
Abstract
Recent advances in large reasoning models LRMs have enabled agentic search systems to perform complex multi-step reasoning across multiple sources. However, most studies focus on general information retrieval and rarely explores vertical domains with unique challenges. In this work, we focus on local life services and introduce LocalSearchBench, which encompass diverse and complex business scenarios. Real-world queries in this domain are often ambiguous and require multi-hop reasoning across merchants and products, remaining challenging and not fully addressed. As the first comprehensive benchmark for agentic search in local life services, LocalSearchBench comprises a database of over 1.3M merchant entries across 6 service categories and 9 major cities, and 900 multi-hop QA tasks from real user queries that require multi-step reasoning. We also developed LocalPlayground, a unified environment integrating multiple tools for LRMs interaction. Experiments show that even state-of-the-art LRMs struggle on LocalSearchBench: the best model (DeepSeek-V3.2) achieves only 35.60% correctness, and most models have issues with completeness (average 60.32%) and faithfulness (average 30.72%). This highlights the need for specialized benchmarks and domain-specific agent training in local life services. Code, Benchmark, and Leaderboard are available at https://localsearchbench.github.io/.
中文标题/摘要
标题:LocalSearchBench:在现实本地生活服务中评估自主搜索系统
大型推理模型LRMs的最新进展使自主搜索系统能够进行复杂的多步推理,跨越多个来源。然而,大多数研究集中在通用信息检索上,很少探索具有独特挑战的垂直领域。在本工作中,我们专注于本地生活服务,并引入了LocalSearchBench,涵盖了多种复杂的企业场景。该领域的真实查询往往具有歧义性,需要跨越商家和产品进行多跳推理,这仍然是一个挑战,尚未完全解决。作为第一个全面的本地生活服务自主搜索基准,LocalSearchBench 包含了来自 6 个服务类别和 9 个主要城市的超过 130 万商家条目数据库,以及 900 个来自真实用户查询的多跳问答任务,需要多步推理。我们还开发了LocalPlayground,这是一个集成多种工具供LRMs交互的统一环境。实验表明,即使是最先进的LRMs在LocalSearchBench上也表现不佳:最佳模型(DeepSeek-V3.2)的正确率为35.60%,大多数模型在完整性(平均60.32%)和忠实性(平均30.72%)方面存在问题。这突显了在本地生活服务中需要专门的基准和领域特定代理训练的需求。代码、基准和排行榜可在 https://localsearchbench.github.io/ 获取。
Summary / 总结
LocalSearchBench benchmarks agentic search systems in local life services by introducing a comprehensive database and multi-hop QA tasks. The study highlights the challenges of multi-step reasoning across merchants and products, showing that state-of-the-art large reasoning models achieve only 35.60% correctness and struggle with completeness and faithfulness. This underscores the need for specialized benchmarks and domain-specific training in local life services.
LocalSearchBench 通过引入包含超过 130 万商户条目和 900 个多跳 QA 任务的综合数据集,对本地生活服务中的智能搜索系统进行了基准测试。研究显示,最先进的大型推理模型表现不佳,最佳模型的正确率为仅 35.60%,且在完整性与真实性方面存在问题。这强调了在本地生活服务中需要专门的基准测试和领域特定的智能体训练。
APEX-SWE
Authors: Abhi Kottamasu, Akul Datta, Aakash Barthwal, Chirag Mahapatra, Ajay Arun, Adarsh Hiremath, Brendan Foody, Bertie Vidgen
First: 2026-01-13T18:44:08+00:00 · Latest: 2026-01-13T18:44:08+00:00
Abstract
We introduce the AI Productivity Index for Software Engineering (APEX-SWE), a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work. Unlike existing evaluations that focus on narrow, well-defined tasks, APEX-SWE assesses two novel task types that reflect real-world software engineering work: (1) Integration tasks (n=100), which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services, and (2) Observability tasks (n=100), which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context. We evaluated eight frontier models on APEX-SWE. Gemini 3 Pro (Thinking = High) performs best, with a Pass@1 score of 25\%. Our analysis shows that strong performance is primarily driven by epistemic reasoning, defined as the ability to distinguish between assumptions and verified facts, combined with agency to resolve uncertainty prior to acting. We open-source the APEX-SWE evaluation harness and a dev set (n=50).
中文标题/摘要
标题:APEX-SWE
我们介绍了软件工程人工智能生产力指数(APEX-SWE),这是一个基准,用于评估前沿AI模型是否能够执行具有经济价值的软件工程工作。与现有的专注于狭窄、明确任务的评估不同,APEX-SWE 评估了两种新型任务类型,这些任务类型反映了实际的软件工程工作:(1)集成任务(n=100),需要构建跨异构云原语、业务应用和基础设施即代码服务的端到端系统;(2)可观测性任务(n=100),需要使用日志、仪表板等遥测信号进行生产故障调试,以及无结构上下文。我们在APEX-SWE 上评估了八种前沿模型。Gemini 3 Pro(思考 = 高)表现最佳,得分为25%。我们的分析表明,强大的表现主要是由先验推理驱动的,即区分假设和验证事实的能力,结合在行动前解决不确定性的能力。我们开源了APEX-SWE 评估框架和一个开发集(n=50)。
Summary / 总结
APEX-SWE is a benchmark to evaluate AI models in executing economically valuable software engineering tasks, including 100 integration tasks and 100 observability tasks. Eight frontier models were evaluated, with Gemini 3 Pro (Thinking = High) achieving the highest Pass@1 score of 25%. The results indicate that strong performance is mainly due to epistemic reasoning and the ability to resolve uncertainty before acting.
APEX-SWE 是一个基准,用于评估 AI 模型执行经济上有价值的软件工程任务的能力,包括集成和可观测性任务。八种领先的 AI 模型进行了评估,Gemini 3 Pro 表现最佳,Pass@1 得分为 25%。主要发现是,强大的性能主要归功于知识性的推理和在行动前解决不确定性的能力。
Free-RBF-KAN: Kolmogorov-Arnold Networks with Adaptive Radial Basis Functions for Efficient Function Learning
Authors: Shao-Ting Chiu, Siu Wun Cheung, Ulisses Braga-Neto, Chak Shing Lee, Rui Peng Li
First: 2026-01-12T17:45:31+00:00 · Latest: 2026-01-13T18:39:13+00:00
Abstract
Kolmogorov-Arnold Networks (KANs) have shown strong potential for efficiently approximating complex nonlinear functions. However, the original KAN formulation relies on B-spline basis functions, which incur substantial computational overhead due to De Boor's algorithm. To address this limitation, recent work has explored alternative basis functions such as radial basis functions (RBFs) that can improve computational efficiency and flexibility. Yet, standard RBF-KANs often sacrifice accuracy relative to the original KAN design. In this work, we propose Free-RBF-KAN, a RBF-based KAN architecture that incorporates adaptive learning grids and trainable smoothness to close this performance gap. Our method employs freely learnable RBF shapes that dynamically align grid representations with activation patterns, enabling expressive and adaptive function approximation. Additionally, we treat smoothness as a kernel parameter optimized jointly with network weights, without increasing computational complexity. We provide a general universality proof for RBF-KANs, which encompasses our Free-RBF-KAN formulation. Through a broad set of experiments, including multiscale function approximation, physics-informed machine learning, and PDE solution operator learning, Free-RBF-KAN achieves accuracy comparable to the original B-spline-based KAN while delivering faster training and inference. These results highlight Free-RBF-KAN as a compelling balance between computational efficiency and adaptive resolution, particularly for high-dimensional structured modeling tasks.
中文标题/摘要
标题:Free-RBF-KAN:具有自适应径向基函数的柯尔莫哥洛夫-阿诺尔德网络,用于高效函数学习
柯尔莫哥洛夫-阿诺尔德网络(KANs)在高效逼近复杂非线性函数方面显示出强大的潜力。然而,原始的KAN公式依赖于B样条基函数,由于德布尔算法,这会导致大量的计算开销。为了解决这一限制,最近的工作探索了替代基函数,如径向基函数(RBFs),以提高计算效率和灵活性。然而,标准的RBF-KAN通常在准确度上不如原始的KAN设计。在本文中,我们提出了一种基于RBF的KAN架构——Free-RBF-KAN,该架构结合了自适应学习网格和可训练的平滑度,以弥补这一性能差距。我们的方法使用可自由学习的RBF形状,动态地使网格表示与激活模式对齐,从而实现表达性和自适应的函数逼近。此外,我们将平滑度视为与网络权重联合优化的核参数,而不增加计算复杂度。我们为RBF-KAN提供了一般性通用性证明,涵盖了我们的Free-RBF-KAN公式。通过一系列广泛的实验,包括多尺度函数逼近、基于物理的机器学习和PDE解算器学习,Free-RBF-KAN在训练和推理速度上都比基于B样条的KAN更快,同时保持了相当的准确性。这些结果突显了Free-RBF-KAN在计算效率和自适应分辨率之间的平衡,特别是在高维结构化建模任务中具有吸引力。
Summary / 总结
Free-RBF-KAN is a Kolmogorov-Arnold Network (KAN) architecture that uses adaptive radial basis functions (RBFs) to improve computational efficiency and accuracy. By incorporating learnable RBF shapes and optimizing smoothness as a kernel parameter, Free-RBF-KAN achieves comparable accuracy to the original B-spline-based KAN while offering faster training and inference. It excels in tasks such as multiscale function approximation, physics-informed machine learning, and partial differential equation (PDE) solution operator learning.
Free-RBF-KAN 是一种新型的基于 RBF 的 Kolmogorov-Arnold 网络 (KAN),通过自适应学习网格和可训练的平滑性来提高计算效率和准确性。它动态调整 RBF 形状以与激活模式对齐,相比传统 KAN 提供更快的训练和推理速度。实验结果表明,Free-RBF-KAN 在高维结构化建模任务中与基于 B-spline 的 KAN 相比具有相当的准确性,同时更为高效。
Near-perfect photo-ID of the Hula painted frog with zero-shot deep local-feature matching
Authors: Maayan Yesharim, R. G. Bina Perl, Uri Roll, Sarig Gafny, Eli Geffen, Yoav Ram
First: 2026-01-13T18:32:43+00:00 · Latest: 2026-01-13T18:32:43+00:00
Comments: 18 pages, 4 figures,
Abstract
Accurate individual identification is essential for monitoring rare amphibians, yet invasive marking is often unsuitable for critically endangered species. We evaluate state-of-the-art computer-vision methods for photographic re-identification of the Hula painted frog (Latonia nigriventer) using 1,233 ventral images from 191 individuals collected during 2013-2020 capture-recapture surveys. We compare deep local-feature matching in a zero-shot setting with deep global-feature embedding models. The local-feature pipeline achieves 98% top-1 closed-set identification accuracy, outperforming all global-feature models; fine-tuning improves the best global-feature model to 60% top-1 (91% top-10) but remains below local matching. To combine scalability with accuracy, we implement a two-stage workflow in which a fine-tuned global-feature model retrieves a short candidate list that is re-ranked by local-feature matching, reducing end-to-end runtime from 6.5-7.8 hours to ~38 minutes while maintaining ~96% top-1 closed-set accuracy on the labeled dataset. Separation of match scores between same- and different-individual pairs supports thresholding for open-set identification, enabling practical handling of novel individuals. We deploy this pipeline as a web application for routine field use, providing rapid, standardized, non-invasive identification to support conservation monitoring and capture-recapture analyses. Overall, in this species, zero-shot deep local-feature matching outperformed global-feature embedding and provides a strong default for photo-identification.
中文标题/摘要
标题:近乎完美的照片识别胡拉涂蛙,零样本深度局部特征匹配
准确的个体识别对于监测稀有两栖动物至关重要,但对于极度濒危物种而言,侵入性标记往往不合适。我们评估了最先进的计算机视觉方法,使用2013-2020年间捕获重捕调查中收集的191个个体的1,233张腹面图像,对胡拉涂蛙(Latonia nigriventer)进行照片再识别。我们将深度局部特征匹配在零样本设置中与深度全局特征嵌入模型进行了比较。局部特征管道实现了98%的闭集识别准确率,优于所有全局特征模型;微调将最佳全局特征模型的准确率提高到60%(91%的前10名),但仍低于局部匹配。为了兼顾可扩展性和准确性,我们实现了一种两阶段工作流,在该工作流中,微调后的全局特征模型检索一个短候选列表,然后通过局部特征匹配重新排名,将端到端运行时间从6.5-7.8小时缩短到约38分钟,同时在标记数据集上保持约96%的闭集识别准确率。相同个体和不同个体配对之间的匹配分数分离支持开放集识别的阈值设定,使处理新个体成为可能。我们将此管道部署为网络应用程序,用于常规野外使用,提供快速、标准化、非侵入性的识别,以支持保护监测和捕获重捕分析。总体而言,在该物种中,零样本深度局部特征匹配优于全局特征嵌入,并为照片识别提供了强有力的标准。
Summary / 总结
The study aims to develop a non-invasive method for identifying individual Hula painted frogs (Latonia nigriventer) using photographic re-identification. The researchers compared deep local-feature matching and deep global-feature embedding models. The local-feature pipeline achieved 98% top-1 closed-set identification accuracy, surpassing all global-feature models. A two-stage workflow combining fine-tuned global-feature retrieval with local-feature re-ranking reduced the end-to-end runtime while maintaining high accuracy. This method supports practical handling of novel individuals and is deployed as a web application for field use in conservation monitoring.
研究旨在通过照片识别方法开发一种非侵入性方法来识别稀有两栖动物胡拉涂鸦蛙(Latonia nigriventer)的个体。研究人员比较了深度局部特征匹配和深度全局特征嵌入模型,发现局部特征管道实现了98%的闭集top-1识别准确率,优于所有全局特征模型。结合微调的全局特征检索与局部特征重新排名的两阶段工作流减少了端到端的运行时间,同时保持了高准确率,支持开放集识别和保护监测。
A Vision for Multisensory Intelligence: Sensing, Science, and Synergy
Authors: Paul Pu Liang
First: 2026-01-08T03:46:20+00:00 · Latest: 2026-01-13T18:24:14+00:00
Abstract
Our experience of the world is multisensory, spanning a synthesis of language, sight, sound, touch, taste, and smell. Yet, artificial intelligence has primarily advanced in digital modalities like text, vision, and audio. This paper outlines a research vision for multisensory artificial intelligence over the next decade. This new set of technologies can change how humans and AI experience and interact with one another, by connecting AI to the human senses and a rich spectrum of signals from physiological and tactile cues on the body, to physical and social signals in homes, cities, and the environment. We outline how this field must advance through three interrelated themes of sensing, science, and synergy. Firstly, research in sensing should extend how AI captures the world in richer ways beyond the digital medium. Secondly, developing a principled science for quantifying multimodal heterogeneity and interactions, developing unified modeling architectures and representations, and understanding cross-modal transfer. Finally, we present new technical challenges to learn synergy between modalities and between humans and AI, covering multisensory integration, alignment, reasoning, generation, generalization, and experience. Accompanying this vision paper are a series of projects, resources, and demos of latest advances from the Multisensory Intelligence group at the MIT Media Lab, see https://mit-mi.github.io/.
中文标题/摘要
标题:多感官智能的愿景:感知、科学与协同
我们对世界的体验是多感官的,涵盖了语言、视觉、听觉、触觉、味觉和嗅觉的综合。然而,人工智能主要在文本、视觉和音频等数字模态上取得了进展。本文概述了未来十年多感官人工智能的研究愿景。这一新技术可以改变人类与人工智能的体验和互动方式,通过将人工智能与人类感官以及从身体的生理和触觉提示到家庭、城市和环境中的物理和社会信号的丰富信号连接起来。我们概述了该领域必须通过感知、科学和协同作用这三个相互关联的主题来推进。首先,感知研究应扩展人工智能如何以更丰富的方式捕捉世界,超越数字媒介。其次,发展一种原则性的科学来量化多模态异质性和相互作用,开发统一的建模架构和表示,并理解跨模态转移。最后,我们提出了新的技术挑战,以学习模态之间的协同作用以及人类与人工智能之间的协同作用,涵盖多感官整合、对齐、推理、生成、泛化和体验。与这篇愿景论文相伴的是来自麻省理工媒体实验室多感官智能小组的多个项目、资源和最新进展的演示,详见https://mit-mi.github.io/。
The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning
Authors: Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, Libo Qin, Wanxiang Che, Wenhao Huang
First: 2026-01-09T18:39:01+00:00 · Latest: 2026-01-13T18:21:01+00:00
Comments: Preprint
Abstract
Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.
中文标题/摘要
标题:思维的分子结构:长链推理拓扑映射
大型语言模型(LLMs)往往难以从人类或非长链推理(Long CoT)LLMs模仿中学习有效的长链推理。为了理解这一现象,我们提出,有效的可学习的长链推理轨迹具有在统一视图中形成的稳定分子状结构,这些结构由三种交互类型组成:深度推理(共价型)、自我反思(氢键型)和自我探索(范德华力型)。对精简轨迹的分析表明,这些结构源自长链推理微调,而非关键词模仿。我们引入了有效语义异构体,并表明仅促进快速熵收敛的键支持稳定的长链推理学习,而结构竞争会损害训练。基于这些发现,我们提出了Mole-Syn方法,这是一种分布转移图方法,用于引导有效长链推理结构的合成,从而在基准测试中提升性能和强化学习稳定性。
Summary / 总结
The research aims to understand why large language models struggle with learning effective long chain-of-thought reasoning. It proposes that stable molecular-like structures, formed by three types of interactions (Deep-Reasoning, Self-Reflection, and Self-Exploration), are key to effective Long CoT learning. The study finds that these structures emerge from Long CoT fine-tuning rather than keyword imitation. Mole-Syn, a method that guides the synthesis of these effective structures, is introduced, showing improved performance and reinforcement learning stability across benchmarks.
研究旨在理解大型语言模型为何难以学习有效的长链推理。研究提出,长链推理轨迹中的稳定分子结构,由深度推理、自我反思和自我探索三种交互形成,对于学习至关重要。分析表明,这些结构来源于长链推理微调,而非关键词模仿。研究引入了Mole-Syn方法,该方法合成有效的长链推理结构,提升了跨基准的性能和强化学习稳定性。
SafePro: Evaluating the Safety of Professional-Level AI Agents
Authors: Kaiwen Zhou, Shreedhar Jangam, Ashwin Nagarajan, Tejas Polu, Suhas Oruganti, Chengzhi Liu, Ching-Chen Kuo, Yuting Zheng, Sravana Narayanaraju, Xin Eric Wang
First: 2026-01-10T19:53:09+00:00 · Latest: 2026-01-13T18:20:33+00:00
Abstract
Large language model-based agents are rapidly evolving from simple conversational assistants into autonomous systems capable of performing complex, professional-level tasks in various domains. While these advancements promise significant productivity gains, they also introduce critical safety risks that remain under-explored. Existing safety evaluations primarily focus on simple, daily assistance tasks, failing to capture the intricate decision-making processes and potential consequences of misaligned behaviors in professional settings. To address this gap, we introduce \textbf{SafePro}, a comprehensive benchmark designed to evaluate the safety alignment of AI agents performing professional activities. SafePro features a dataset of high-complexity tasks across diverse professional domains with safety risks, developed through a rigorous iterative creation and review process. Our evaluation of state-of-the-art AI models reveals significant safety vulnerabilities and uncovers new unsafe behaviors in professional contexts. We further show that these models exhibit both insufficient safety judgment and weak safety alignment when executing complex professional tasks. In addition, we investigate safety mitigation strategies for improving agent safety in these scenarios and observe encouraging improvements. Together, our findings highlight the urgent need for robust safety mechanisms tailored to the next generation of professional AI agents.
中文标题/摘要
标题:SafePro:评估专业级AI代理的安全性
基于大型语言模型的代理正在迅速从简单的对话助手演变为能够在各种领域执行复杂专业级任务的自主系统。尽管这些进步有望带来显著的生产率提升,但也引入了关键的安全风险,这些风险目前尚未得到充分探索。现有的安全性评估主要集中在简单的日常辅助任务上,未能捕捉到专业环境中复杂决策过程和潜在的不一致行为后果。为了解决这一差距,我们引入了**SafePro**,这是一个全面的基准测试,旨在评估执行专业活动的AI代理的安全对齐情况。SafePro 包含了一个跨多种专业领域的高复杂度任务数据集,这些任务具有安全风险,并通过严格的迭代创建和审查过程开发。我们对最先进的AI模型的评估揭示了显著的安全漏洞,并在专业环境中发现了新的不安全行为。我们进一步表明,这些模型在执行复杂专业任务时表现出不足的安全判断和弱的安全对齐。此外,我们还研究了提高这些场景中代理安全性的安全缓解策略,并观察到令人鼓舞的改进。总之,我们的研究结果突显了为下一代专业级AI代理设计稳健的安全机制的迫切需求。
Summary / 总结
SafePro is a benchmark designed to evaluate the safety alignment of AI agents performing professional tasks. It addresses the gap in existing safety evaluations by focusing on complex, professional-level tasks. The evaluation of state-of-the-art AI models reveals significant safety vulnerabilities and unsafe behaviors in professional contexts. The study also investigates safety mitigation strategies, showing encouraging improvements in agent safety. This work underscores the need for robust safety mechanisms for professional AI agents.
SafePro 是一个全面的基准,旨在评估 AI 代理在执行专业活动时的安全对齐情况。它通过关注各种领域的高复杂度任务来解决与先进 AI 系统相关的重大安全风险。对最先进的 AI 模型的评估揭示了显著的安全漏洞和新的专业情境中的不安全行为,表明安全判断不足和安全对齐薄弱。研究还探讨了安全缓解策略,显示出有希望的改进。这项工作强调了为专业 AI 代理配备 robust 安全机制的迫切需要。
FastFLUX: Pruning FLUX with Block-wise Replacement and Sandwich Training
Authors: Fuhan Cai, Yong Guo, Jie Li, Wenbo Li, Jian Chen, Xiangzhong Fang
First: 2025-06-10T20:48:30+00:00 · Latest: 2026-01-13T18:20:18+00:00
Comments: 14 pages
Abstract
Recent advancements in text-to-image (T2I) generation have led to the emergence of highly expressive models such as diffusion transformers (DiTs), exemplified by FLUX. However, their massive parameter sizes lead to slow inference, high memory usage, and poor deployability. Existing acceleration methods (e.g., single-step distillation and attention pruning) often suffer from significant performance degradation and incur substantial training costs. To address these limitations, we propose FastFLUX, an architecture-level pruning framework designed to enhance the inference efficiency of FLUX. At its core is the Block-wise Replacement with Linear Layers (BRLL) method, which replaces structurally complex residual branches in ResBlocks with lightweight linear layers while preserving the original shortcut connections for stability. Furthermore, we introduce Sandwich Training (ST), a localized fine-tuning strategy that leverages LoRA to supervise neighboring blocks, mitigating performance drops caused by structural replacement. Experiments show that our FastFLUX maintains high image quality under both qualitative and quantitative evaluations, while significantly improving inference speed, even with 20\% of the hierarchy pruned. Our code will be available soon.
中文标题/摘要
标题:FastFLUX:使用块级替换和三明治训练精简FLUX
最近在文本到图像(T2I)生成方面的进展催生了诸如扩散变换器(DiTs)等高度表达性的模型,以FLUX为代表。然而,这些模型庞大的参数量导致了推理速度慢、高内存使用和部署困难的问题。现有的加速方法(例如单步蒸馏和注意力剪枝)往往会导致性能显著下降,并且需要大量的训练成本。为了解决这些问题,我们提出了FastFLUX,这是一种架构级的剪枝框架,旨在提高FLUX的推理效率。其核心是块级替换加线性层(BRLL)方法,该方法用轻量级的线性层替换残差块中的结构复杂分支,同时保留原始的捷径连接以保持稳定性。此外,我们还引入了三明治训练(ST),这是一种局部微调策略,利用LoRA监督相邻块,以减轻结构替换导致的性能下降。实验表明,我们的FastFLUX在定性和定量评估中均能保持高质量的图像生成,同时显著提高推理速度,即使有20%的层级被剪枝。我们的代码将很快开源。
Summary / 总结
FastFLUX is an architecture-level pruning framework for enhancing the inference efficiency of the diffusion transformer model FLUX. It uses Block-wise Replacement with Linear Layers (BRLL) to replace complex residual branches with lightweight linear layers while maintaining stability through shortcut connections. Additionally, Sandwich Training (ST) is introduced to fine-tune neighboring blocks using LoRA, reducing performance degradation. Experiments demonstrate that FastFLUX maintains high image quality and significantly improves inference speed, even when 20% of the model hierarchy is pruned.
FastFLUX 是一种用于提升 FLUX(一种高度表达性的扩散变换器模型)推理效率的架构级剪枝框架。它使用 Block-wise Replacement with Linear Layers (BRLL) 方法将复杂的残差分支替换为轻量级的线性层,同时通过保留快捷连接来保持稳定性。此外,引入了 Sandwich Training (ST) 方法,利用 LoRA 对相邻块进行局部微调,减少结构替换引起的性能下降。实验表明,FastFLUX 在保持高质量图像的同时,显著提高了推理速度,即使有 20% 的层级被剪枝。
Uncovering Political Bias in Large Language Models using Parliamentary Voting Records
Authors: Jieying Chen, Karen de Jong, Andreas Poole, Jan Burakowski, Elena Elderson Nosti, Joep Windt, Chendi Wang
First: 2026-01-13T18:18:25+00:00 · Latest: 2026-01-13T18:18:25+00:00
Abstract
As large language models (LLMs) become deeply embedded in digital platforms and decision-making systems, concerns about their political biases have grown. While substantial work has examined social biases such as gender and race, systematic studies of political bias remain limited, despite their direct societal impact. This paper introduces a general methodology for constructing political bias benchmarks by aligning model-generated voting predictions with verified parliamentary voting records. We instantiate this methodology in three national case studies: PoliBiasNL (2,701 Dutch parliamentary motions and votes from 15 political parties), PoliBiasNO (10,584 motions and votes from 9 Norwegian parties), and PoliBiasES (2,480 motions and votes from 10 Spanish parties). Across these benchmarks, we assess ideological tendencies and political entity bias in LLM behavior. As part of our evaluation framework, we also propose a method to visualize the ideology of LLMs and political parties in a shared two-dimensional CHES (Chapel Hill Expert Survey) space by linking their voting-based positions to the CHES dimensions, enabling direct and interpretable comparisons between models and real-world political actors. Our experiments reveal fine-grained ideological distinctions: state-of-the-art LLMs consistently display left-leaning or centrist tendencies, alongside clear negative biases toward right-conservative parties. These findings highlight the value of transparent, cross-national evaluation grounded in real parliamentary behavior for understanding and auditing political bias in modern LLMs.
中文标题/摘要
标题:利用议会投票记录揭示大型语言模型中的政治偏见
随着大型语言模型(LLMs)在数字平台和决策系统中的深入应用,对其政治偏见的担忧日益增加。尽管已经进行了大量关于社会偏见(如性别和种族)的研究,但对政治偏见的系统研究仍然有限,尽管它们对社会有直接影响。本文介绍了一种通过将模型生成的投票预测与验证的议会投票记录对齐来构建政治偏见基准的一般方法。我们在三个国家案例研究中实例化了这种方法:PoliBiasNL(15个政党2,701份荷兰议会动议和投票记录)、PoliBiasNO(9个政党10,584份挪威动议和投票记录)和PoliBiasES(10个政党2,480份西班牙动议和投票记录)。在这些基准中,我们评估了LLM行为中的意识形态倾向和政治实体偏见。作为评估框架的一部分,我们还提出了一种方法,通过将基于投票的位置与CHES(查佩尔希尔专家调查)维度链接起来,在共享的二维CHES空间中可视化LLM和政治党的意识形态,从而实现模型与现实世界政治行为之间的直接和可解释的比较。我们的实验揭示了细微的意识形态差异:最先进的LLMs始终表现出左倾或中间派倾向,并且对右翼保守政党有明显的负面偏见。这些发现突显了基于实际议会行为的透明、跨国评估对于理解现代LLM中的政治偏见的价值和审计的重要性。
Summary / 总结
This paper addresses the growing concern about political biases in large language models (LLMs) by developing a methodology to align model-generated voting predictions with parliamentary voting records. The study evaluates ideological tendencies and political entity bias in LLM behavior across three national case studies: PoliBiasNL, PoliBiasNO, and PoliBiasES. The experiments show that state-of-the-art LLMs exhibit left-leaning or centrist tendencies and have clear negative biases towards right-conservative parties, emphasizing the need for transparent, cross-national evaluation to understand and audit political bias in LLMs.
该论文通过开发一种将模型生成的投票预测与议会投票记录对齐的方法,来应对大型语言模型(LLM)中日益增长的政治偏见问题。研究在三个国家案例研究中评估了LLM的行为中的意识形态倾向和政治实体偏见:PoliBiasNL、PoliBiasNO和PoliBiasES。实验结果显示,最先进的LLM表现出左倾或中间派的倾向,并对右保守政党有明显的负面偏见,强调了进行透明的跨国评估以理解LLM中的政治偏见的重要性。
Incentivizing Multi-Tenant Split Federated Learning for Foundation Models at the Network Edge
Authors: Songyuan Li, Jia Hu, Geyong Min, Haojun Huang
First: 2025-03-06T21:06:27+00:00 · Latest: 2026-01-13T18:16:48+00:00
Comments: Accepted for publication in IEEE/ACM Transactions on Networking. Index Terms: Foundation models, Edge computing, Split federated learning, Multi-tenant system, Incentive mechanism
Abstract
Foundation models (FMs) such as GPT-4 exhibit exceptional generative capabilities across diverse downstream tasks through fine-tuning. Split Federated Learning (SFL) facilitates privacy-preserving FM fine-tuning on resource-constrained local devices by offloading partial FM computations to edge servers, enabling device-edge synergistic fine-tuning. Practical edge networks often host multiple SFL tenants to support diversified downstream tasks. However, existing research primarily focuses on single-tenant SFL scenarios, and lacks tailored incentive mechanisms for multi-tenant settings, which are essential to effectively coordinate self-interested local devices for participation in various downstream tasks, ensuring that each SFL tenant's distinct FM fine-tuning requirements (e.g., FM types, performance targets, and fine-tuning deadlines) are met. To address this gap, we propose a novel Price-Incentive Mechanism (PRINCE) that guides multiple SFL tenants to offer strategic price incentives, which solicit high-quality device participation for efficient FM fine-tuning. Specifically, we first develop a bias-resilient global SFL model aggregation scheme to eliminate model biases caused by independent device participation. We then derive a rigorous SFL convergence bound to evaluate the contributions of heterogeneous devices to FM performance improvements, guiding the incentive strategies of SFL tenants. Furthermore, we model inter-tenant device competition as a congestion game for Stackelberg equilibrium (SE) analysis, deriving each SFL tenant's optimal incentive strategy. Extensive simulations involving four representative SFL tenant types (ViT, BERT, Whisper, and LLaMA) across diverse data modalities (text, images, and audio) demonstrate that PRINCE accelerates FM fine-tuning by up to 3.07x compared to state-of-the-art approaches, while consistently meeting fine-tuning performance targets.
中文标题/摘要
标题:在网络边缘激励多租户分割联邦学习以促进基础模型
基础模型(FMs)如GPT-4通过微调在多种下游任务中表现出卓越的生成能力。分割联邦学习(SFL)通过将部分FM计算卸载到边缘服务器,使资源受限的本地设备能够进行协同微调,从而实现隐私保护。实际的边缘网络通常支持多个SFL租户以支持多样化的下游任务。然而,现有研究主要集中在单租户SFL场景上,缺乏针对多租户设置的定制激励机制,这对于协调自利的本地设备参与各种下游任务至关重要,确保每个SFL租户独特的FM微调需求(如FM类型、性能目标和微调截止日期)得到满足。为解决这一问题,我们提出了一种新的价格激励机制(PRINCE),引导多个SFL租户提供战略价格激励,以促进高质量设备的参与,从而高效地进行FM微调。具体而言,我们首先开发了一种抗偏差的全局SFL模型聚合方案,以消除独立设备参与导致的模型偏差。然后,我们推导出严格的SFL收敛界,以评估异构设备对FM性能改进的贡献,指导SFL租户的激励策略。此外,我们将租户间设备竞争建模为拥堵博弈,进行Stackelberg均衡(SE)分析,推导出每个SFL租户的最优激励策略。针对四种代表性SFL租户类型(ViT、BERT、Whisper和LLaMA)在不同数据模态(文本、图像和音频)下的广泛仿真表明,与最先进的方法相比,PRINCE可将FM微调加速3.07倍,同时始终满足微调性能目标。
Summary / 总结
The research aims to address the lack of incentive mechanisms for multi-tenant split federated learning (SFL) in edge networks, which is crucial for coordinating self-interested local devices. The proposed Price-Incentive Mechanism (PRINCE) guides SFL tenants to offer strategic price incentives, ensuring efficient fine-tuning of foundation models (FMs) while meeting diverse fine-tuning requirements. PRINCE includes a bias-resilient global SFL model aggregation scheme and a congestion game analysis for Stackelberg equilibrium, leading to up to 3.07x faster FM fine-tuning compared to existing methods.
论文针对网络边缘环境下多租户分拆联邦学习(SFL)中激励基础模型(FM)微调的挑战,提出了一种价格激励机制(PRINCE),使租户能够提供战略性的价格激励以吸引高质量设备的参与。PRINCE 包括一种抗偏差的全局模型聚合方案和一个交通拥堵博弈分析以确定最优激励策略。仿真结果显示,PRINCE 可以将 FM 微调加速至最高 3.07 倍,同时满足各种 FM 类型和数据模态下的性能目标。
Pervasive Annotation Errors Break Text-to-SQL Benchmarks and Leaderboards
Authors: Tengjun Jin, Yoojin Choi, Yuxuan Zhu, Daniel Kang
First: 2026-01-13T18:09:06+00:00 · Latest: 2026-01-13T18:09:06+00:00
Comments: 18 pages, 14 figures, 9 tables
Abstract
Researchers have proposed numerous text-to-SQL techniques to streamline data analytics and accelerate the development of database-driven applications. To compare these techniques and select the best one for deployment, the community depends on public benchmarks and their leaderboards. Since these benchmarks heavily rely on human annotations during question construction and answer evaluation, the validity of the annotations is crucial.
In this paper, we conduct an empirical study that (i) benchmarks annotation error rates for two widely used text-to-SQL benchmarks, BIRD and Spider 2.0-Snow, and (ii) corrects a subset of the BIRD development (Dev) set to measure the impact of annotation errors on text-to-SQL agent performance and leaderboard rankings. Through expert analysis, we show that BIRD Mini-Dev and Spider 2.0-Snow have error rates of 52.8% and 62.8%, respectively. We re-evaluate all 16 open-source agents from the BIRD leaderboard on both the original and the corrected BIRD Dev subsets. We show that performance changes range from -7% to 31% (in relative terms) and rank changes range from $-9$ to $+9$ positions. We further assess whether these impacts generalize to the full BIRD Dev set. We find that the rankings of agents on the uncorrected subset correlate strongly with those on the full Dev set (Spearman's $r_s$=0.85, $p$=3.26e-5), whereas they correlate weakly with those on the corrected subset (Spearman's $r_s$=0.32, $p$=0.23). These findings show that annotation errors can significantly distort reported performance and rankings, potentially misguiding research directions or deployment choices. Our code and data are available at https://github.com/uiuc-kang-lab/text_to_sql_benchmarks.
中文标题/摘要
标题:普遍的注释错误破坏了文本到SQL基准和排行榜
研究人员提出了许多文本到SQL技术以简化数据分析并加速数据库驱动应用的开发。为了比较这些技术并选择最适合部署的最佳技术,社区依赖于公开的基准和排行榜。由于这些基准在问题构建和答案评估过程中高度依赖于人工注释,注释的有效性至关重要。
在本文中,我们进行了一项实证研究,(i) 对两个广泛使用的文本到SQL基准BIRD和Spider 2.0-Snow的注释错误率进行了基准测试,(ii) 修正了BIRD开发集的一部分,以测量注释错误对文本到SQL代理性能和排行榜排名的影响。通过专家分析,我们展示了BIRD Mini-Dev和Spider 2.0-Snow的错误率分别为52.8%和62.8%。我们重新评估了BIRD排行榜上的所有16个开源代理在原始和修正后的BIRD开发集子集上的性能。我们展示了性能变化范围从-7%到31%(相对而言),排名变化范围从-9到+9位。我们进一步评估了这些影响是否适用于完整的BIRD开发集。我们发现,未修正子集上的代理排名与完整开发集上的排名高度相关(Spearman's $r_s$=0.85,$p$=3.26e-5),而与修正后的子集上的排名相关性较弱(Spearman's $r_s$=0.32,$p$=0.23)。这些发现表明,注释错误可以显著扭曲报告的性能和排名,可能误导研究方向或部署选择。我们的代码和数据可在https://github.com/uiuc-kang-lab/text_to_sql_benchmarks/获取。
Summary / 总结
This paper investigates the impact of annotation errors on text-to-SQL benchmarks by analyzing the error rates in BIRD and Spider 2.0-Snow, and correcting a subset of the BIRD development set. The study reveals error rates of 52.8% and 62.8% for BIRD Mini-Dev and Spider 2.0-Snow, respectively. Evaluating 16 open-source agents on both the original and corrected BIRD Dev subsets, the research shows performance changes ranging from -7% to 31% and rank changes from -9 to +9 positions. The findings indicate that annotation errors can significantly distort benchmark results, potentially misleading research and deployment decisions.
本文通过分析两个广泛使用的基准BIRD和Spider 2.0-Snow中的注释错误率,研究了注释错误对文本到SQL基准的影响。研究纠正了BIRD开发集的一部分,并重新评估了16个开源代理,结果显示性能和排名因这些错误而显著变化。研究发现,报告的性能和排名可能具有误导性,可能会影响研究和部署决策。
Asymptotic Universal Alignment: A New Alignment Framework via Test-Time Scaling
Authors: Yang Cai, Weiqiang Zheng
First: 2026-01-13T18:08:06+00:00 · Latest: 2026-01-13T18:08:06+00:00
Abstract
Aligning large language models (LLMs) to serve users with heterogeneous and potentially conflicting preferences is a central challenge for personalized and trustworthy AI. We formalize an ideal notion of universal alignment through test-time scaling: for each prompt, the model produces $k\ge 1$ candidate responses and a user selects their preferred one. We introduce $(k,f(k))$-robust alignment, which requires the $k$-output model to have win rate $f(k)$ against any other single-output model, and asymptotic universal alignment (U-alignment), which requires $f(k)\to 1$ as $k\to\infty$. Our main result characterizes the optimal convergence rate: there exists a family of single-output policies whose $k$-sample product policies achieve U-alignment at rate $f(k)=\frac{k}{k+1}$, and no method can achieve a faster rate in general.
We show that popular post-training methods, including Nash learning from human feedback (NLHF), can fundamentally underutilize the benefits of test-time scaling. Even though NLHF is optimal for $k=1$, sampling from the resulting (often deterministic) policy cannot guarantee win rates above $\tfrac{1}{2}$ except for an arbitrarily small slack. This stems from a lack of output diversity: existing alignment methods can collapse to a single majority-preferred response, making additional samples redundant. In contrast, our approach preserves output diversity and achieves the optimal test-time scaling rate. In particular, we propose a family of symmetric multi-player alignment games and prove that any symmetric Nash equilibrium policy of the $(k+1)$-player alignment game achieves the optimal $(k,\frac{k}{k+1})$-robust alignment. Finally, we provide theoretical convergence guarantees for self-play learning dynamics in these games and extend the framework to opponents that also generate multiple responses.
Reliable Graph-RAG for Codebases: AST-Derived Graphs vs LLM-Extracted Knowledge Graphs
Authors: Manideep Reddy Chinthareddy
First: 2026-01-13T18:03:41+00:00 · Latest: 2026-01-13T18:03:41+00:00
Comments: 46 pages, 2 figures
Abstract
Retrieval-Augmented Generation for software engineering often relies on vector similarity search, which captures topical similarity but can fail on multi-hop architectural reasoning such as controller to service to repository chains, interface-driven wiring, and inheritance. This paper benchmarks three retrieval pipelines on Java codebases (Shopizer, with additional runs on ThingsBoard and OpenMRS Core): (A) vector-only No-Graph RAG, (B) an LLM-generated knowledge graph RAG (LLM-KB), and (C) a deterministic AST-derived knowledge graph RAG (DKB) built with Tree-sitter and bidirectional traversal.
Using 15 architecture and code-tracing queries per repository, we measure indexing time, query latency, corpus coverage, cost, and answer correctness. DKB builds its graph in seconds, while LLM-KB requires much longer graph generation. LLM-KB also shows indexing incompleteness: on Shopizer, 377 files are skipped or missed, reducing embedded chunk coverage and graph size compared to DKB. End-to-end cost is modest for DKB relative to the vector-only baseline but much higher for LLM-KB, especially as repository scale increases. Query latency is similar for No-Graph and DKB, while LLM-KB is slower and more variable. On the Shopizer question suite, DKB achieves the highest correctness, LLM-KB is close behind, and the vector-only baseline performs worst on upstream architectural queries and has the highest hallucination risk. Overall, deterministic AST-derived graphs provide more reliable coverage and multi-hop grounding than LLM-extracted graphs at substantially lower indexing cost.
中文标题/摘要
标题:代码库中可靠的图-RAG:AST派生图与LLM提取的知识图
软件工程中的检索增强生成通常依赖于向量相似性搜索,这可以捕捉主题相似性,但在多跳架构推理(如控制器到服务到存储库链、接口驱动的连接和继承)方面可能会失败。本文在Java代码库(Shopizer,额外运行于ThingsBoard和OpenMRS Core)上对三种检索管道进行了基准测试:(A)仅向量的无图RAG,(B)由LLM生成的知识图RAG(LLM-KB),以及(C)使用Tree-sitter和双向遍历构建的确定性AST派生知识图RAG(DKB)。
使用每个仓库15个架构和代码追踪查询,我们测量了索引时间、查询延迟、语料库覆盖率、成本和答案准确性。DKB在其秒内构建了图,而LLM-KB需要更长的时间来生成图。LLM-KB还显示了索引不完整性:在Shopizer上,有377个文件被跳过或遗漏,导致嵌入片段覆盖率和图大小低于DKB。端到端成本对于DKB相对较低,但相对于仅向量基线来说更高,尤其是随着仓库规模的增加。查询延迟对于无图和DKB相似,而LLM-KB则更慢且更不稳定。在Shopizer问题集中,DKB的正确性最高,LLM-KB紧随其后,仅向量基线在上游架构查询中表现最差且具有最高的幻觉风险。总体而言,确定性AST派生图在显著降低索引成本的同时提供了更可靠的覆盖和多跳定位,而LLM提取的图则不然。
Summary / 总结
This paper evaluates retrieval-augmented generation methods for software engineering using vector similarity search, LLM-generated knowledge graphs, and deterministic AST-derived knowledge graphs on Java codebases. It finds that AST-derived knowledge graphs (DKB) are faster to build, more complete, and more cost-effective than LLM-generated knowledge graphs (LLM-KB). DKB also shows higher correctness, especially for upstream architectural queries, compared to vector-only methods and LLM-KB, which is slower and more variable in latency.
本文评估了使用向量相似性搜索、LLM生成的知识图和基于AST的确定性知识图对Java代码库的检索增强生成方法。研究发现,基于AST的确定性知识图(DKB)构建速度快、更完整且成本更低,优于LLM生成的知识图(LLM-KB)。DKB在正确性方面也更高,尤其是在上游架构查询方面,而LLM-KB则在延迟方面更慢且更不稳定。
STELP: Secure Transpilation and Execution of LLM-Generated Programs
Authors: Swapnil Shinde, Sahil Wadhwa, Andy Luo, Akshay Gupta, Mohammad Shahed Sorower
First: 2026-01-09T01:49:41+00:00 · Latest: 2026-01-13T17:55:11+00:00
Abstract
Rapid evolution of Large Language Models (LLMs) has achieved major advances in reasoning, planning, and function-calling capabilities. Multi-agentic collaborative frameworks using such LLMs place them at the center of solving software development-related tasks such as code generation. However, direct use of LLM generated code in production software development systems is problematic. The code could be unstable or erroneous and contain vulnerabilities such as data poisoning, malicious attacks, and hallucinations that could lead to widespread system malfunctions. This prohibits the adoption of LLM generated code in production AI systems where human code reviews and traditional secure testing tools are impractical or untrustworthy. In this paper, we discuss safety and reliability problems with the execution of LLM generated code and propose a Secure Transpiler and Executor of LLM-Generated Program (STELP), capable of executing LLM-generated code in a controlled and safe manner. STELP secures autonomous production AI systems involving code generation, filling the critical void left by the impracticality or limitations of traditional secure testing methodologies and human oversight. This includes applications such as headless code generation-execution and LLMs that produce executable code snippets as an action plan to be executed in real time. We contribute a human-validated dataset of insecure code snippets and benchmark our approach on publicly available datasets for correctness, safety, and latency. Our results demonstrate that our approach outperforms an existing method by a significant margin, particularly in its ability to safely execute risky code snippets. Warning: This paper contains malicious code snippets that should be run with caution.
中文标题/摘要
标题:STELP:安全转换和执行LLM生成的程序
大型语言模型(LLMs)的快速进化在推理、规划和函数调用能力方面取得了重大进展。使用此类LLMs的多智能体协作框架将它们置于解决软件开发相关任务(如代码生成)的核心位置。然而,直接在生产软件开发系统中使用LLM生成的代码存在问题。这些代码可能不稳定或错误,并且可能包含数据中毒、恶意攻击和幻觉等漏洞,可能导致系统广泛故障。这阻碍了在生产AI系统中采用LLM生成的代码,其中人工代码审查和传统安全测试工具是不切实际或不可信的。在本文中,我们讨论了执行LLM生成代码的安全性和可靠性问题,并提出了一种安全转换和执行LLM生成程序(STELP)的方法,能够以受控和安全的方式执行LLM生成的代码。STELP 为涉及代码生成的自主生产AI系统提供了安全保障,填补了传统安全测试方法和人工监督不切实际或有限的空白。这包括无头代码生成-执行和LLM生成可执行代码片段作为实时执行的操作计划的应用。我们贡献了一个经过人工验证的不安全代码片段数据集,并在公开可用的数据集上对我们的方法进行了正确性、安全性和延迟的基准测试。我们的结果表明,与现有方法相比,我们的方法在安全执行风险代码片段方面表现出显著优势。警告:本文包含恶意代码片段,应谨慎运行。
Summary / 总结
This paper addresses the challenges of using code generated by Large Language Models (LLMs) in production systems, which can be unstable, erroneous, and contain vulnerabilities. To mitigate these risks, the authors propose STELP, a secure transpiler and executor for LLM-generated programs. STELP ensures the safe execution of LLM-generated code by validating and sanitizing it, thereby enhancing the safety and reliability of AI systems that rely on code generation. Experimental results show that STELP outperforms existing methods in safely executing risky code snippets, demonstrating significant improvements in correctness, safety, and latency.
本文针对大型语言模型(LLM)生成的代码在生产系统中使用时存在的风险,提出了STELP,这是一种安全的转译和执行器。STELP通过缓解数据污染和幻觉等漏洞,确保LLM生成代码的安全执行。该方法通过使用经过人工验证的数据集和公开基准进行验证,显示出在正确性、安全性和延迟方面比现有方法具有显著优势,尤其是在处理风险代码片段方面表现更佳。
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
Authors: Zhiyuan Hu, Yucheng Wang, Yufei He, Jiaying Wu, Yilun Zhao, See-Kiong Ng, Cynthia Breazeal, Anh Tuan Luu, Hae Won Park, Bryan Hooi
First: 2026-01-13T17:48:43+00:00 · Latest: 2026-01-13T17:48:43+00:00
Comments: Work in Progress
Abstract
Reinforcement learning (RL) has become a central paradigm for post-training large language models (LLMs), particularly for complex reasoning tasks, yet it often suffers from exploration collapse: policies prematurely concentrate on a small set of dominant reasoning patterns, improving pass@1 while limiting rollout-level diversity and gains in pass@k. We argue that this failure stems from regularizing local token behavior rather than diversity over sets of solutions. To address this, we propose Uniqueness-Aware Reinforcement Learning, a rollout-level objective that explicitly rewards correct solutions that exhibit rare high-level strategies. Our method uses an LLM-based judge to cluster rollouts for the same problem according to their high-level solution strategies, ignoring superficial variations, and reweights policy advantages inversely with cluster size. As a result, correct but novel strategies receive higher rewards than redundant ones. Across mathematics, physics, and medical reasoning benchmarks, our approach consistently improves pass@$k$ across large sampling budgets and increases the area under the pass@$k$ curve (AUC@$K$) without sacrificing pass@1, while sustaining exploration and uncovering more diverse solution strategies at scale.
中文标题/摘要
标题:奖励独特性:面向创意问题解决的LLMs独特性感知RL
强化学习(RL)已成为大型语言模型(LLMs)后训练的核心范式,特别是在复杂推理任务中,但它经常遭受探索崩溃的问题:策略过早地集中在少数主导推理模式上,提高了pass@1,但限制了策略级的多样性和pass@k的收益。我们认为这种失败源于对局部token行为的正则化,而不是对解决方案集的多样性。为了解决这个问题,我们提出了独特性感知强化学习,这是一种策略级目标,明确奖励表现出罕见高层策略的正确解决方案。该方法使用基于LLM的裁判将相同问题的策略根据其高层解决方案策略聚类,忽略表面差异,并根据聚类大小反向重置策略优势。因此,正确但新颖的策略比冗余策略获得更高的奖励。在数学、物理和医学推理基准测试中,我们的方法在大规模采样预算下始终如一地提高了pass@$k$,并增加了pass@$k$曲线下的面积(AUC@$K$),同时不牺牲pass@1,而保持了探索并揭示了更多多样化的解决方案策略。
Summary / 总结
The paper addresses the issue of exploration collapse in reinforcement learning for large language models, where policies tend to focus on a few dominant reasoning patterns. It introduces Uniqueness-Aware Reinforcement Learning, which rewards solutions that exhibit rare high-level strategies, using an LLM-based judge to cluster rollouts and reweight policy advantages. Experiments show consistent improvements in pass@$k$ across various reasoning benchmarks, increasing the AUC@$K$ without compromising pass@1, and promoting more diverse solution strategies.
论文针对强化学习在大型语言模型中出现的探索枯竭问题,即政策倾向于聚焦于少数主导推理模式。提出了基于独特性的强化学习方法,通过奖励罕见的高层面策略来促进多样性。该方法使用基于LLM的裁判来聚类相似的高层面解决方案,并按集群大小的倒数重新加权策略优势,从而在各种基准测试中提高了pass@$k$,同时不牺牲pass@1,并增加了pass@$k$曲线下的面积。
MDReID: Modality-Decoupled Learning for Any-to-Any Multi-Modal Object Re-Identification
Authors: Yingying Feng, Jie Li, Jie Hu, Yukang Zhang, Lei Tan, Jiayi Ji
Venue: NeurIPS 2025
First: 2025-10-27T13:08:46+00:00 · Latest: 2026-01-13T17:44:49+00:00
Comments: Accepted by NeurIPS 2025
Abstract
Real-world object re-identification (ReID) systems often face modality inconsistencies, where query and gallery images come from different sensors (e.g., RGB, NIR, TIR). However, most existing methods assume modality-matched conditions, which limits their robustness and scalability in practical applications. To address this challenge, we propose MDReID, a flexible any-to-any image-level ReID framework designed to operate under both modality-matched and modality-mismatched scenarios. MDReID builds on the insight that modality information can be decomposed into two components: modality-shared features that are predictable and transferable, and modality-specific features that capture unique, modality-dependent characteristics. To effectively leverage this, MDReID introduces two key components: the Modality Decoupling Learning (MDL) and Modality-aware Metric Learning (MML). Specifically, MDL explicitly decomposes modality features into modality-shared and modality-specific representations, enabling effective retrieval in both modality-aligned and mismatched scenarios. MML, a tailored metric learning strategy, further enforces orthogonality and complementarity between the two components to enhance discriminative power across modalities. Extensive experiments conducted on three challenging multi-modality ReID benchmarks (RGBNT201, RGBNT100, MSVR310) consistently demonstrate the superiority of MDReID. Notably, MDReID achieves significant mAP improvements of 9.8\%, 3.0\%, and 11.5\% in general modality-matched scenarios, and average gains of 3.4\%, 11.8\%, and 10.9\% in modality-mismatched scenarios, respectively. The code is available at: \textcolor{magenta}{https://github.com/stone96123/MDReID}.
中文标题/摘要
标题:MDReID:解耦模态学习的任意到任意多模态物体重识别
现实世界中的物体重识别(ReID)系统经常面临模态不一致的问题,其中查询和画廊图像来自不同的传感器(例如,RGB、NIR、TIR)。然而,大多数现有方法假设模态匹配的条件,这限制了它们在实际应用中的鲁棒性和可扩展性。为了解决这一挑战,我们提出了一种灵活的任意到任意图像级ReID框架MDReID,该框架设计用于在模态匹配和模态不匹配的场景下运行。MDReID基于这样一个洞察:模态信息可以分解为两个部分:模态共享特征,这些特征是可预测和可转移的,以及模态特定特征,这些特征捕捉独特的、模态依赖的特性。为了有效利用这一点,MDReID引入了两个关键组件:模态解耦学习(MDL)和模态感知度量学习(MML)。具体来说,MDL明确地将模态特征分解为模态共享和模态特定表示,使在模态对齐和不匹配的场景下都能有效检索。MML是一种定制的度量学习策略,进一步确保了两个组件之间的正交性和互补性,以增强跨模态的判别力。在三个具有挑战性的多模态ReID基准(RGBNT201、RGBNT100、MSVR310)上进行的大量实验一致地证明了MDReID的优势。值得注意的是,MDReID在一般模态匹配场景中实现了9.8%、3.0%和11.5%的显著mAP改进,在模态不匹配场景中分别实现了3.4%、11.8%和10.9%的平均收益。代码可在:https://github.com/stone96123/MDReID 获取。
Summary / 总结
MDReID is a flexible framework for object re-identification that addresses modality inconsistencies in real-world scenarios. It decomposes modality features into shared and specific components and uses Modality Decoupling Learning (MDL) and Modality-aware Metric Learning (MML) to enhance retrieval performance. Experiments on three benchmarks show that MDReID outperforms existing methods, with significant improvements in both modality-matched and mismatched scenarios, achieving mAP gains of up to 11.5% and 11.8% respectively.
MDReID 是一个灵活的对象重识别框架,旨在解决查询和画廊图像之间模态不一致的问题。它将模态特征分解为共享和特定组件,并使用模态解耦学习(MDL)和模态感知度量学习(MML)来增强检索性能。在三个基准上的实验表明,MDReID 在模态匹配场景中的 mAP 提高了 9.8%、3.0% 和 11.5%,在模态不匹配场景中的 mAP 分别提高了 3.4%、11.8% 和 10.9%。
M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding
Authors: Juntao Jiang, Jiangning Zhang, Yali Bi, Jinsheng Bai, Weixuan Liu, Weiwei Jin, Zhucun Xue, Yong Liu, Xiaobin Hu, Shuicheng Yan
First: 2026-01-13T17:42:27+00:00 · Latest: 2026-01-13T17:42:27+00:00
Comments: 40 pages, 8 pages
Abstract
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at https://juntaojianggavin.github.io/projects/M3CoTBench/.
中文标题/摘要
标题:M3CoTBench:医学影像理解中MLLMs的链式思维基准
链式思维(CoT)推理已被证明能有效提升大型语言模型,通过鼓励逐步的中间推理,而最近的进展将这一范式扩展到了多模态大型语言模型(MLLMs)。在医学领域,诊断决策依赖于细微的视觉线索和顺序推理,CoT 与临床思维过程自然契合。然而,当前用于医学影像理解的基准一般侧重于最终答案,而忽视了推理路径。不透明的过程缺乏可靠的判断基础,难以帮助医生进行诊断。为解决这一差距,我们引入了一个新的M3CoTBench基准,专门用于评估医学影像理解中CoT推理的正确性、效率、影响和一致性。M3CoTBench包括1)涵盖24种检查类型的多样化、多层次难度数据集,2)13种不同难度的任务,3)一套针对临床推理的CoT特定评估指标(正确性、效率、影响和一致性),以及4)多种MLLMs的性能分析。M3CoTBench系统地评估了不同医学成像任务中的CoT推理,揭示了MLLMs在生成可靠且临床可解释的推理方面的当前局限性,并旨在促进透明、可信且诊断准确的AI系统的开发,以服务于医疗保健。项目页面:https://juntaojianggavin.github.io/projects/M3CoTBench/
Summary / 总结
M3CoTBench is a new benchmark designed to evaluate the CoT reasoning of MLLMs in medical image understanding. It includes a diverse dataset with 24 examination types and 13 varying-difficulty tasks, along with specific evaluation metrics for correctness, efficiency, impact, and consistency. The benchmark reveals that current MLLMs struggle to generate reliable and clinically interpretable reasoning, highlighting the need for more transparent and accurate AI systems in healthcare.
M3CoTBench 是一个新基准,旨在评估 MLLMs 在医学影像理解中的 CoT 推理能力。它包含一个多样化的数据集,涵盖 24 种检查类型和 13 个不同难度的任务,并配有针对临床推理的正确性、效率、影响和一致性等特定评估指标。该基准揭示了 MLLMs 在生成可靠且临床可解释的推理方面的当前局限性,旨在促进更透明和可信赖的 AI 系统的发展,以用于医疗保健。
Hybrid Reward-Driven Reinforcement Learning for Efficient Quantum Circuit Synthesis
Authors: Sara Giordano, Kornikar Sen, Miguel A. Martin-Delgado
First: 2025-07-22T14:39:20+00:00 · Latest: 2026-01-13T17:34:14+00:00
Comments: 35 pages, 7 figures, color figures
Abstract
A reinforcement learning (RL) framework is introduced for the efficient synthesis of quantum circuits that generate specified target quantum states from a fixed initial state, addressing a central challenge in both the Noisy Intermediate-Scale Quantum (NISQ) era and future fault-tolerant quantum computing. The approach utilizes tabular Q-learning, based on action sequences, within a discretized quantum state space, to effectively manage the exponential growth of the space dimension.The framework introduces a hybrid reward mechanism, combining a static, domain-informed reward that guides the agent toward the target state with customizable dynamic penalties that discourage inefficient circuit structures such as gate congestion and redundant state revisits. This is a circuit-aware reward, in contrast to the current trend of works on this topic, which are primarily fidelity-based. By leveraging sparse matrix representations and state-space discretization, the method enables practical navigation of high-dimensional environments while minimizing computational overhead. Benchmarking on graph-state preparation tasks for up to seven qubits, we demonstrate that the algorithm consistently discovers minimal-depth circuits with optimized gate counts. Moreover, extending the framework to a universal gate set still yields low depth circuits, highlighting the algorithm robustness and adaptability. The results confirm that this RL-driven approach, with our completely circuit-aware method, efficiently explores the complex quantum state space and synthesizes near-optimal quantum circuits, providing a resource-efficient foundation for quantum circuit optimization.
中文标题/摘要
标题:混合奖励驱动的强化学习在高效量子电路合成中的应用
提出了一种强化学习(RL)框架,用于从固定初始状态高效合成生成指定目标量子态的量子电路,解决了无噪声中等规模量子(NISQ)时代和未来容错量子计算中的核心挑战。该方法利用基于动作序列的表格Q学习,在离散化的量子态空间中有效管理空间维度的指数增长。该框架引入了一种混合奖励机制,结合了一个静态、领域导向的奖励,引导代理向目标状态移动,以及可定制的动态惩罚,以避免诸如门拥堵和重复状态访问等低效电路结构。这是一种电路感知的奖励,与当前该领域工作的主要基于保真度的方法不同。通过利用稀疏矩阵表示和状态空间离散化,该方法能够在最小化计算开销的同时,实现高维环境的有效导航。在最多七量子比特的图态准备任务上进行基准测试,我们证明该算法能够一致地发现具有优化门计数的最小深度电路。此外,将该框架扩展到通用门集仍然能够生成低深度电路,突显了该算法的鲁棒性和适应性。结果表明,这种基于RL的方法,结合我们完全电路感知的方法,能够高效地探索复杂的量子态空间,并合成接近最优的量子电路,为量子电路优化提供了一种资源高效的基石。
Summary / 总结
This paper introduces a reinforcement learning framework for synthesizing quantum circuits that generate specific target quantum states from a fixed initial state. The approach uses tabular Q-learning in a discretized quantum state space to manage the exponential growth of the state space. A hybrid reward mechanism, combining a static, domain-informed reward and customizable dynamic penalties, guides the agent towards the target state while discouraging inefficient circuit structures. The method demonstrates the ability to discover minimal-depth circuits with optimized gate counts for graph-state preparation tasks up to seven qubits, and it remains robust when extended to a universal gate set, confirming its efficiency and adaptability in exploring the complex quantum state space.
研究提出了一种基于强化学习的量子电路合成框架,用于从固定初始状态生成指定的目标量子态。该方法使用在离散量子态空间中的表Q学习,并采用结合静态目标导向奖励和动态惩罚的混合奖励机制。该方法在最多七量子位的图态准备任务中展示了能够找到具有优化门数的最小深度电路的能力,并且在扩展到通用门集时仍能生成低深度电路,证实了该方法在探索量子态空间方面的高效性和适应性。
Grid-Aware Charging and Operational Optimization for Mixed-Fleet Public Transit
Authors: Rishav Sen, Amutheezan Sivagnanam, Aron Laszka, Ayan Mukhopadhyay, Abhishek Dubey
Venue: 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC), 2024
First: 2026-01-13T17:30:25+00:00 · Latest: 2026-01-13T17:30:25+00:00
Comments: 7 pages, 7 figures, 4 algorithms. Published in the Proceedings of the 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC)
Abstract
The rapid growth of urban populations and the increasing need for sustainable transportation solutions have prompted a shift towards electric buses in public transit systems. However, the effective management of mixed fleets consisting of both electric and diesel buses poses significant operational challenges. One major challenge is coping with dynamic electricity pricing, where charging costs vary throughout the day. Transit agencies must optimize charging assignments in response to such dynamism while accounting for secondary considerations such as seating constraints. This paper presents a comprehensive mixed-integer linear programming (MILP) model to address these challenges by jointly optimizing charging schedules and trip assignments for mixed (electric and diesel bus) fleets while considering factors such as dynamic electricity pricing, vehicle capacity, and route constraints. We address the potential computational intractability of the MILP formulation, which can arise even with relatively small fleets, by employing a hierarchical approach tailored to the fleet composition. By using real-world data from the city of Chattanooga, Tennessee, USA, we show that our approach can result in significant savings in the operating costs of the mixed transit fleets.
中文标题/摘要
标题:适应电网的混合车队公共交通充电与运营优化
随着城市人口的快速增长和对可持续交通解决方案的需求增加,公共交通系统正转向电动巴士。然而,管理由电动巴士和柴油巴士组成的混合车队带来了重大的运营挑战。一个主要挑战是应对动态电价,充电成本会随时间变化。公交机构必须根据这种动态性优化充电分配,同时考虑如座位限制等次要因素。本文提出了一种综合的混合整数线性规划(MILP)模型,通过同时优化混合(电动和柴油巴士)车队的充电时间表和行程分配,考虑动态电价、车辆容量和路线限制等因素来应对这些挑战。我们通过一种针对车队组成定制的分层方法来解决MILP公式可能带来的潜在计算不可行性问题,即使车队规模相对较小也是如此。通过使用美国田纳西州查塔努加市的实际数据,我们展示了我们的方法可以显著降低混合公交车队的运营成本。
Summary / 总结
This paper addresses the operational challenges of managing mixed fleets of electric and diesel buses in public transit systems, particularly in response to dynamic electricity pricing. It introduces a mixed-integer linear programming (MILP) model to optimize charging schedules and trip assignments, considering factors like vehicle capacity and route constraints. Using real-world data from Chattanooga, Tennessee, the study demonstrates substantial savings in operating costs for mixed transit fleets.
本文探讨了在应对动态电价时管理混合车队(电动和柴油巴士)在公共交通系统中的运营挑战。它提出了一种混合整数线性规划(MILP)模型,以优化充电时间和行程分配,同时考虑车辆容量和路线限制等因素。通过使用美国田纳西州查塔努加市的实际数据,研究展示了混合巴士车队运营成本的显著节省。
MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation
Authors: Changli Wu, Haodong Wang, Jiayi Ji, Yutian Yao, Chunsai Du, Jihua Kang, Yanwei Fu, Liujuan Cao
First: 2026-01-11T11:44:07+00:00 · Latest: 2026-01-13T17:29:39+00:00
Comments: Project Website: https://sosppxo.github.io/mvggt.github.io/
Abstract
Most existing 3D referring expression segmentation (3DRES) methods rely on dense, high-quality point clouds, while real-world agents such as robots and mobile phones operate with only a few sparse RGB views and strict latency constraints. We introduce Multi-view 3D Referring Expression Segmentation (MV-3DRES), where the model must recover scene structure and segment the referred object directly from sparse multi-view images. Traditional two-stage pipelines, which first reconstruct a point cloud and then perform segmentation, often yield low-quality geometry, produce coarse or degraded target regions, and run slowly. We propose the Multimodal Visual Geometry Grounded Transformer (MVGGT), an efficient end-to-end framework that integrates language information into sparse-view geometric reasoning through a dual-branch design. Training in this setting exposes a critical optimization barrier, termed Foreground Gradient Dilution (FGD), where sparse 3D signals lead to weak supervision. To resolve this, we introduce Per-view No-target Suppression Optimization (PVSO), which provides stronger and more balanced gradients across views, enabling stable and efficient learning. To support consistent evaluation, we build MVRefer, a benchmark that defines standardized settings and metrics for MV-3DRES. Experiments show that MVGGT establishes the first strong baseline and achieves both high accuracy and fast inference, outperforming existing alternatives. Code and models are publicly available at https://mvggt.github.io.
中文标题/摘要
标题:MVGGT:多模态视觉几何导向变换器在多视图3D指示表达分割中的应用
大多数现有的3D指示表达分割(3DRES)方法依赖于密集的高质量点云,而现实世界中的代理设备如机器人和手机仅能获取少量稀疏的RGB视图,并且具有严格的延迟限制。我们提出了多视图3D指示表达分割(MV-3DRES),其中模型必须直接从稀疏多视图图像中恢复场景结构并分割指示的对象。传统的两阶段管道首先重建点云,然后进行分割,通常会导致低质量的几何结构,产生粗略或退化的目标区域,并且运行速度较慢。我们提出了多模态视觉几何导向变换器(MVGGT),这是一种高效的端到端框架,通过双分支设计将语言信息整合到稀疏视图几何推理中。在这种设置下进行训练暴露出一个关键的优化障碍,称为前景梯度稀释(FGD),其中稀疏的3D信号导致监督较弱。为了解决这个问题,我们引入了视图无目标抑制优化(PVSO),它提供了更强且更平衡的梯度,使学习更加稳定和高效。为了支持一致的评估,我们构建了MVRefer基准,定义了MV-3DRES的标准设置和指标。实验表明,MVGGT建立了第一个强大的基线,并实现了高精度和快速推理,优于现有方法。代码和模型已公开发布于https://mvggt.github.io。
Summary / 总结
The research addresses the limitations of existing 3DRES methods that rely on dense point clouds, which are not practical for real-world applications with sparse RGB views and latency constraints. It introduces MVGGT, an end-to-end framework that integrates language information into sparse-view geometric reasoning. The key finding is that MVGGT resolves the optimization barrier of Foreground Gradient Dilution and provides both high accuracy and fast inference, outperforming existing methods. Training includes Per-view No-target Suppression Optimization to ensure stable and efficient learning.
研究针对现有依赖密集点云的3DRES方法在实际应用中难以处理稀疏RGB视图和延迟限制的问题。提出了MVGGT框架,该框架将语言信息整合到稀疏视图的几何推理中。关键发现是MVGGT解决了前景梯度稀释的优化障碍,并实现了高精度和快速推理,优于现有方法。训练过程中使用了视图无目标抑制优化来确保稳定和高效的训练。
To Retrieve or To Think? An Agentic Approach for Context Evolution
Authors: Rubing Chen, Jian Wang, Wenjie Li, Xiao-Yong Wei, Qing Li
First: 2026-01-13T17:25:57+00:00 · Latest: 2026-01-13T17:25:57+00:00
Abstract
Current context augmentation methods, such as retrieval-augmented generation, are essential for solving knowledge-intensive reasoning tasks.However, they typically adhere to a rigid, brute-force strategy that executes retrieval at every step. This indiscriminate approach not only incurs unnecessary computational costs but also degrades performance by saturating the context with irrelevant noise. To address these limitations, we introduce Agentic Context Evolution (ACE), a framework inspired by human metacognition that dynamically determines whether to seek new evidence or reason with existing knowledge. ACE employs a central orchestrator agent to make decisions strategically via majority voting.It aims to alternate between activating a retriever agent for external retrieval and a reasoner agent for internal analysis and refinement. By eliminating redundant retrieval steps, ACE maintains a concise and evolved context. Extensive experiments on challenging multi-hop QA benchmarks demonstrate that ACE significantly outperforms competitive baselines in accuracy while achieving efficient token consumption.Our work provides valuable insights into advancing context-evolved generation for complex, knowledge-intensive tasks.
中文标题/摘要
标题:取回还是思考?一种代理导向的上下文演化方法
当前的上下文增强方法,如检索增强生成,对于解决知识密集型推理任务至关重要。然而,它们通常遵循一种僵化的、粗暴的策略,在每一步都执行检索。这种不分青红皂白的方法不仅导致不必要的计算成本,还会通过饱和上下文以无关噪声降低性能。为解决这些局限性,我们引入了代理导向的上下文演化(ACE),这是一种受人类元认知启发的框架,能够动态决定是寻求新证据还是利用现有知识进行推理。ACE 通过中央协调代理体,通过多数投票策略进行战略性决策。它旨在交替激活检索代理体进行外部检索和推理代理体进行内部分析和优化。通过消除冗余的检索步骤,ACE 维持了一个简洁且演化的上下文。在具有挑战性的多跳 QA 基准测试中,ACE 显著优于竞争性基线,在准确性和高效地消耗标记方面均表现出色。我们的工作为复杂、知识密集型任务的上下文演化生成提供了宝贵的见解。
Summary / 总结
The paper addresses the limitations of current context augmentation methods, such as retrieval-augmented generation, which often incur unnecessary computational costs and degrade performance by including irrelevant information. To overcome these issues, the authors propose Agentic Context Evolution (ACE), a framework that uses a central agent to decide between retrieving new information or reasoning with existing knowledge. ACE alternates between a retriever and a reasoner to maintain a concise and relevant context. Experiments show that ACE outperforms existing methods in accuracy while using fewer tokens, providing a valuable approach for complex reasoning tasks.
论文针对当前上下文增强方法存在的问题,这些方法通常采用僵化的检索策略,可能导致计算成本高且引入无关噪声。为解决这些问题,作者提出了Agentic Context Evolution (ACE)框架,该框架通过中央代理决定是检索新信息还是利用现有知识进行推理。ACE在多跳问答基准测试中表现出色,相比现有方法,它在准确性和高效使用令牌方面都取得了更好的效果。
TableCache: Primary Foreign Key Guided KV Cache Precomputation for Low Latency Text-to-SQL
Authors: Jinbo Su, Yuxuan Hu, Cuiping Li, Hong Chen, Jia Li, Lintao Ma, Jing Zhang
First: 2026-01-13T17:20:55+00:00 · Latest: 2026-01-13T17:20:55+00:00
Abstract
In Text-to-SQL tasks, existing LLM-based methods often include extensive database schemas in prompts, leading to long context lengths and increased prefilling latency. While user queries typically focus on recurrent table sets-offering an opportunity for KV cache sharing across queries-current inference engines, such as SGLang and vLLM, generate redundant prefix cache copies when processing user queries with varying table orders. To address this inefficiency, we propose precomputing table representations as KV caches offline and querying the required ones online. A key aspect of our approach is the computation of table caches while preserving primary foreign key relationships between tables. Additionally, we construct a Table Trie structure to facilitate efficient KV cache lookups during inference. To enhance cache performance, we introduce a cache management system with a query reranking strategy to improve cache hit rates and a computation loading pipeline for parallelizing model inference and cache loading. Experimental results show that our proposed TableCache achieves up to a 3.62x speedup in Time to First Token (TTFT) with negligible performance degradation.
中文标题/摘要
标题:TableCache:基于主外键引导的KV缓存预计算以实现低延迟文本到SQL
在文本到SQL任务中,现有的基于LLM的方法通常在提示中包含广泛的数据库模式,导致上下文长度过长并增加预填充延迟。虽然用户查询通常集中在反复出现的表集上,提供了跨查询共享KV缓存的机会,但当前的推理引擎,如SGLang和vLLM,在处理具有不同表顺序的用户查询时会生成冗余的前缀缓存副本。为了解决这一低效率问题,我们提出在离线计算表表示作为KV缓存,并在线查询所需的缓存。我们方法的关键方面是在保持表之间主外键关系的同时计算表缓存。此外,我们构建了一个表Trie结构,以促进推理期间的高效KV缓存查找。为了提高缓存性能,我们引入了一个缓存管理系统,其中包括查询重排序策略以提高缓存命中率,以及计算加载流水线以并行化模型推理和缓存加载。实验结果表明,我们提出的TableCache在首个标记时间(TTFT)上实现了最高3.62倍的加速,且性能下降可以忽略不计。
Summary / 总结
The paper proposes TableCache, a method for precomputing table representations as KV caches offline to reduce prefilling latency in Text-to-SQL tasks. By preserving primary foreign key relationships and using a Table Trie structure for efficient cache lookups, the approach achieves up to a 3.62x speedup in Time to First Token (TTFT) with minimal performance loss.
论文旨在通过离线预计算表表示作为KV缓存,并在线查询所需缓存,同时保留表之间的主外键关系。该方法使用表Trie结构进行高效的缓存查找,并包含一个带有查询重排序和计算加载流水线的缓存管理系统以实现并行处理。实验结果显示,这种方法可以实现高达3.62倍的首次令牌时间(TTFT)加速,同时性能损失可以忽略不计。
GSAlign: Geometric and Semantic Alignment Network for Aerial-Ground Person Re-Identification
Authors: Qiao Li, Jie Li, Yukang Zhang, Lei Tan, Jing Chen, Jiayi Ji
Venue: Neurips 2025
First: 2025-10-25T12:16:10+00:00 · Latest: 2026-01-13T17:19:03+00:00
Comments: Accepted by Neurips 2025
Abstract
Aerial-Ground person re-identification (AG-ReID) is an emerging yet challenging task that aims to match pedestrian images captured from drastically different viewpoints, typically from unmanned aerial vehicles (UAVs) and ground-based surveillance cameras. The task poses significant challenges due to extreme viewpoint discrepancies, occlusions, and domain gaps between aerial and ground imagery. While prior works have made progress by learning cross-view representations, they remain limited in handling severe pose variations and spatial misalignment. To address these issues, we propose a Geometric and Semantic Alignment Network (GSAlign) tailored for AG-ReID. GSAlign introduces two key components to jointly tackle geometric distortion and semantic misalignment in aerial-ground matching: a Learnable Thin Plate Spline (LTPS) Module and a Dynamic Alignment Module (DAM). The LTPS module adaptively warps pedestrian features based on a set of learned keypoints, effectively compensating for geometric variations caused by extreme viewpoint changes. In parallel, the DAM estimates visibility-aware representation masks that highlight visible body regions at the semantic level, thereby alleviating the negative impact of occlusions and partial observations in cross-view correspondence. A comprehensive evaluation on CARGO with four matching protocols demonstrates the effectiveness of GSAlign, achieving significant improvements of +18.8\% in mAP and +16.8\% in Rank-1 accuracy over previous state-of-the-art methods on the aerial-ground setting.
中文标题/摘要
标题:GSAlign:用于空地行人再识别的几何和语义对齐网络
空地行人再识别(AG-ReID)是一项新兴且具有挑战性的任务,旨在匹配从无人机和地面监控摄像头截获的视角差异极大的行人图像。由于视角差异极大、遮挡和空地图像之间的领域差距,该任务面临重大挑战。尽管先前的工作通过学习跨视角表示取得了进展,但它们在处理严重的姿态变化和空间对齐问题方面仍然有限。为了解决这些问题,我们提出了一种针对AG-ReID的几何和语义对齐网络(GSAlign)。GSAlign引入了两个关键组件,以同时解决空地匹配中的几何失真和语义对齐问题:可学习的薄板样条(LTPS)模块和动态对齐模块(DAM)。LTPS模块根据一组学习到的关键点自适应地扭曲行人特征,有效补偿了极端视角变化引起的几何变化。同时,DAM估计了基于语义的可见性感知表示掩码,突出显示跨视角对应中的可见身体区域,从而减轻了遮挡和部分观察的负面影响。在CARGO上使用四种匹配协议进行的全面评估表明,GSAlign的有效性,相对于先前的最先进方法,在空地设置中实现了mAP提高18.8%和Rank-1精度提高16.8%。
Summary / 总结
GSAlign is designed to address the challenges of aerial-ground person re-identification by aligning geometric and semantic features. It introduces a Learnable Thin Plate Spline (LTPS) Module to handle geometric distortions and a Dynamic Alignment Module (DAM) to manage semantic misalignment. Experimental results on the CARGO dataset show that GSAlign outperforms previous methods, improving mAP by 18.8% and Rank-1 accuracy by 16.8%.
研究旨在通过提出GSAlign,一种几何和语义对齐网络,解决航地行人再识别(AG-ReID)的挑战。GSAlign 包括一个可学习的薄板样条(LTPS)模块进行几何失真校正和一个动态对齐模块(DAM)进行语义对齐。LTPS模块基于学习到的关键点适应性地调整行人特征,而DAM估计可见性感知的掩码以突出可见的身体区域。在CARGO数据集上的实验表明,GSAlign将mAP提高了18.8%,Rank-1准确率提高了16.8%,优于先前的方法。
Inferring Latent Intentions: Attributional Natural Language Inference in LLM Agents
Authors: Xin Quan, Jiafeng Xiong, Marco Valentino, André Freitas
First: 2026-01-13T17:18:38+00:00 · Latest: 2026-01-13T17:18:38+00:00
Abstract
Attributional inference, the ability to predict latent intentions behind observed actions, is a critical yet underexplored capability for large language models (LLMs) operating in multi-agent environments. Traditional natural language inference (NLI), in fact, fails to capture the nuanced, intention-driven reasoning essential for complex interactive systems. To address this gap, we introduce Attributional NLI (Att-NLI), a framework that extends NLI with principles from social psychology to assess an agent's capacity for abductive intentional inference (generating hypotheses about latent intentions), and subsequent deductive verification (drawing valid logical conclusions). We instantiate Att-NLI via a textual game, Undercover-V, experimenting with three types of LLM agents with varying reasoning capabilities and access to external tools: a standard NLI agent using only deductive inference, an Att-NLI agent employing abductive-deductive inference, and a neuro-symbolic Att-NLI agent performing abductive-deductive inference with external theorem provers. Extensive experiments demonstrate a clear hierarchy of attributional inference capabilities, with neuro-symbolic agents consistently outperforming others, achieving an average win rate of 17.08%. Our results underscore the role that Att-NLI can play in developing agents with sophisticated reasoning capabilities, highlighting, at the same time, the potential impact of neuro-symbolic AI in building rational LLM agents acting in multi-agent environments.
中文标题/摘要
标题:推断潜在意图:LLM代理中的归因自然语言推理
归因推理,预测观察到的行为背后的潜在意图的能力,是大型语言模型(LLMs)在多代理环境中运行时至关重要的但尚未充分探索的能力。事实上,传统的自然语言推理(NLI)无法捕捉到复杂交互系统中至关重要的意图驱动的推理。为了解决这一差距,我们引入了归因NLI(Att-NLI)框架,该框架通过从社会心理学中引入原则扩展了NLI,以评估代理进行 abduction 意图推理(生成关于潜在意图的假设)和随后的演绎验证(得出有效的逻辑结论)的能力。我们通过文本游戏Undercover-V实例化Att-NLI,实验了三种具有不同推理能力和外部工具访问权限的LLM代理:仅使用演绎推理的标准NLI代理,使用演绎- abduction 推理的Att-NLI代理,以及使用外部定理证明器进行演绎- abduction 推理的神经符号Att-NLI代理。广泛的实验表明,归因推理能力存在明显的层次结构,神经符号代理始终表现出色,平均胜率高达17.08%。我们的结果强调了Att-NLI在开发具有复杂推理能力的代理中的作用,同时突显了神经符号AI在构建多代理环境中理性LLM代理方面的潜在影响。
Summary / 总结
The research aims to enhance large language models (LLMs) in predicting latent intentions behind observed actions, addressing the limitations of traditional natural language inference (NLI). It introduces Attributional NLI (Att-NLI), which extends NLI with social psychology principles to assess abductive and deductive reasoning capabilities. Through the textual game Undercover-V, three types of LLM agents were tested: a standard NLI agent, an Att-NLI agent, and a neuro-symbolic Att-NLI agent. The neuro-symbolic agent consistently outperformed others, achieving an average win rate of 17.08%, demonstrating the potential of Att-NLI in developing sophisticated reasoning capabilities in LLMs.
研究旨在通过引入 Attributional Natural Language Inference (Att-NLI) 来增强大型语言模型 (LLMs),Att-NLI 扩展了传统的 NLI,结合社会心理学原则来推断潜在意图。研究评估了三种类型的 LLM 代理:标准 NLI 代理、Att-NLI 代理和神经符号 Att-NLI 代理。实验表明,神经符号代理的表现优于其他代理,平均胜率为 17.08%,表明 Att-NLI 在开发具有复杂推理能力的 LLM 代理方面的潜力,这些代理能够在多代理环境中发挥作用。