arXiv 论文速递

Snapshot: 20260317_0403

Visual-ERM: Reward Modeling for Visual Equivalence

Authors: Ziyu Liu, Shengyuan Ding, Xinyu Fang, Xuanlang Dai, Penghui Yang, Jianze Liang, Jiaqi Wang, Kai Chen, Dahua Lin, Yuhang Zang

First: 2026-03-13T17:58:14+00:00 · Latest: 2026-03-13T17:58:14+00:00

Comments: Project: https://github.com/InternLM/Visual-ERM

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-to-code tasks require models to reconstruct structured visual inputs, such as charts, tables, and SVGs, into executable or structured representations with high visual fidelity. While recent Large Vision Language Models (LVLMs) achieve strong results via supervised fine-tuning, reinforcement learning remains challenging due to misaligned reward signals. Existing rewards either rely on textual rules or coarse visual embedding similarity, both of which fail to capture fine-grained visual discrepancies and are vulnerable to reward hacking. We propose Visual Equivalence Reward Model (Visual-ERM), a multimodal generative reward model that provides fine-grained, interpretable, and task-agnostic feedback to evaluate vision-to-code quality directly in the rendered visual space. Integrated into RL, Visual-ERM improves Qwen3-VL-8B-Instruct by +8.4 on chart-to-code and yields consistent gains on table and SVG parsing (+2.7, +4.1 on average), and further strengthens test-time scaling via reflection and revision. We also introduce VisualCritic-RewardBench (VC-RewardBench), a benchmark for judging fine-grained image-to-image discrepancies on structured visual data, where Visual-ERM at 8B decisively outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models. Our results suggest that fine-grained visual reward supervision is both necessary and sufficient for vision-to-code RL, regardless of task specificity.

中文标题/摘要

标题：Visual-ERM：视觉等价性奖励建模

视觉到代码任务要求模型将结构化的视觉输入，如图表、表格和SVG，重构为具有高视觉保真的可执行或结构化表示。虽然最近的大规模视觉语言模型（LVLM）通过监督微调取得了出色的结果，但强化学习仍然具有挑战性，因为奖励信号存在对齐问题。现有的奖励要么依赖于文本规则，要么依赖粗略的视觉嵌入相似性，这两种方法都无法捕捉细微的视觉差异，并且容易受到奖励作弊的影响。我们提出了视觉等价性奖励模型（Visual-ERM），这是一种多模态生成奖励模型，能够提供细微的、可解释的和任务无关的反馈，直接在渲染的视觉空间中评估视觉到代码的质量。将Visual-ERM集成到强化学习中，可以提高Qwen3-VL-8B-Instruct在图表到代码任务上的表现（提高8.4分），并在表格和SVG解析任务上获得一致的改进（平均提高2.7分和4.1分），并通过反思和修订进一步增强了测试时的扩展性。我们还引入了VisualCritic-RewardBench（VC-RewardBench），这是一个用于评估结构化视觉数据上细微图像差异的基准，其中Visual-ERM在8B规模下显著优于Qwen3-VL-235B-Instruct，并接近领先的企业级模型。我们的结果表明，无论任务具体性如何，细微的视觉奖励监督都是视觉到代码强化学习所必需且足够的。

Summary / 总结

Visual-ERM is a multimodal generative reward model designed to provide fine-grained, interpretable, and task-agnostic feedback for vision-to-code tasks. It improves the Qwen3-VL-8B-Instruct model by 8.4 on chart-to-code and yields consistent gains on table and SVG parsing, with an average improvement of +2.7 and +4.1, respectively. Visual-ERM also outperforms Qwen3-VL-235B-Instruct and approaches leading closed-source models on the VisualCritic-RewardBench benchmark, demonstrating the necessity and sufficiency of fine-grained visual reward supervision for vision-to-code reinforcement learning.

Visual-ERM 是一个多模态生成奖励模型，旨在为视觉到代码任务提供细粒度、可解释且任务无关的反馈。它使 Qwen3-VL-8B-Instruct 模型在图表到代码任务上提高了 8.4 分，并在表格和 SVG 解析上保持一致的改进。Visual-ERM 在 VisualCritic-RewardBench 基准测试中也超过了更大的模型，并接近领先的企业级模型，表明细粒度的视觉奖励监督对于视觉到代码的强化学习至关重要。

Automatic In-Domain Exemplar Construction and LLM-Based Refinement of Multi-LLM Expansions for Query Expansion

Authors: Minghan Li, Ercong Nie, Siqi Zhao, Tongna Chen, Huiping Huang, Guodong Zhou

First: 2026-02-09T17:16:39+00:00 · Latest: 2026-03-13T17:55:59+00:00

Comments: Preprint. This paper is under consideration at Pattern Recognition Letters

Abs · PDF · Code1 · Code2

Abstract

Query expansion with large language models is promising but often relies on hand-crafted prompts, manually chosen exemplars, or a single LLM, making it non-scalable and sensitive to domain shift. We present an automated, domain-adaptive QE framework that builds in-domain exemplar pools by harvesting pseudo-relevant passages using a BM25-MonoT5 pipeline. A training-free cluster-based strategy selects diverse demonstrations, yielding strong and stable in-context QE without supervision. To further exploit model complementarity, we introduce a two-LLM ensemble in which two heterogeneous LLMs independently generate expansions and a refinement LLM consolidates them into one coherent expansion. Across TREC DL20, DBPedia, and SciFact, the refined ensemble delivers consistent and statistically significant gains over BM25, Rocchio, zero-shot, and fixed few-shot baselines. The framework offers a reproducible testbed for exemplar selection and multi-LLM generation, and a practical, label-free solution for real-world QE.

中文标题/摘要

标题：自动领域内示例构建及基于LLM的多LLM扩展精炼用于查询扩展

使用大型语言模型进行查询扩展前景广阔，但通常依赖于手工制作的提示、手动选择的示例或单一的LLM，这使其难以扩展且对领域转移敏感。我们提出了一种自动的领域自适应查询扩展框架，通过使用BM25-MonoT5流水线采集伪相关段落来构建领域内示例池。无训练的基于聚类的策略选择多样化的演示，从而在无监督的情况下获得强大且稳定的上下文查询扩展。为了进一步利用模型互补性，我们引入了一种两LLM集成方法，在该方法中，两个异构的LLM独立生成扩展，而精炼LLM将它们整合成一个连贯的扩展。在TREC DL20、DBPedia和SciFact上，精炼的集成在BM25、Rocchio、零样本和固定少量样本基线上提供了持续且统计上显著的改进。该框架提供了一个可重复的示例选择和多LLM生成测试平台，并为实际查询扩展提供了一种实用的、无需标注的解决方案。

Summary / 总结

The research aims to address the limitations of existing query expansion methods by proposing an automated, domain-adaptive framework. It uses a BM25-MonoT5 pipeline to construct in-domain exemplar pools and a cluster-based strategy to select diverse demonstrations. The framework employs a two-LLM ensemble to generate and refine query expansions, showing consistent and significant improvements over existing methods across TREC DL20, DBPedia, and SciFact datasets.

研究针对依赖手工构造提示和单一LLM的方法存在的非可扩展性和领域漂移敏感性问题。提出了一种自动化框架，通过BM25-MonoT5管道构建领域内示例池，并通过基于聚类的策略选择多样化的演示。该框架使用两LLM的组合来生成和精炼查询扩展，实现了在TREC DL20、DBPedia和SciFact数据集上相对于现有方法的一致和显著改进。

MaDiS: Taming Masked Diffusion Language Models for Sign Language Generation

Authors: Ronglai Zuo, Rolandos Alexandros Potamias, Qi Sun, Evangelos Ververas, Jiankang Deng, Stefanos Zafeiriou

First: 2026-01-27T13:06:47+00:00 · Latest: 2026-03-13T17:48:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Sign language generation (SLG) aims to translate written texts into expressive sign motions, bridging communication barriers for the Deaf and Hard-of-Hearing communities. Recent studies formulate SLG within the language modeling framework using autoregressive language models, which suffer from unidirectional context modeling and slow token-by-token inference. To address these limitations, we present MaDiS, a masked-diffusion-based language model for SLG that captures bidirectional dependencies and supports efficient parallel multi-token generation. We further introduce a tri-level cross-modal pretraining scheme that jointly learns from token-, latent-, and 3D physical-space objectives to leverage complementary, multi-level sign representations. To accelerate model convergence in the fine-tuning stage, we design a novel unmasking strategy with temporal checkpoints, which restructures generation in a coarse-to-fine manner and reduces the combinatorial complexity of unmasking orders by over $10^{41}$ times. In addition, a mixture-of-parts embedding layer is developed to effectively fuse information stored in different part-wise sign tokens through a learnable gate and well-optimized codebooks. Extensive experiments on CSL-Daily, Phoenix-2014T, and How2Sign demonstrate that MaDiS achieves superior performance across multiple metrics, including DTW error and two newly introduced metrics, SiBLEU and SiCLIP, while delivering a 40\% higher throughput. Code and models will be publicly released.

中文标题/摘要

标题：MaDiS: 控制掩码扩散语言模型进行手语生成

手语生成（SLG）旨在将书面文本翻译成富有表现力的手势动作，为聋人和听力障碍者社区架起沟通桥梁。近期研究将SLG建模为语言模型框架，使用自回归语言模型，但这些模型存在单向上下文建模和逐个标记推理速度慢的问题。为解决这些问题，我们提出了MaDiS，一种基于掩码扩散的语言模型，能够捕捉双向依赖关系并支持高效的并行多标记生成。我们还引入了一种三级跨模态预训练方案，联合学习标记级、潜在级和三维物理空间级的目标，以利用互补的多级手语表示。为了在微调阶段加速模型收敛，我们设计了一种新颖的去掩码策略，带有时间检查点，以粗到细的方式重新构建生成过程，并将去掩码顺序的组合复杂度降低了超过$10^{41}$倍。此外，我们开发了一种部分混合嵌入层，通过可学习门控和优化的码本有效融合存储在不同部分标记中的信息。在CSL-Daily、Phoenix-2014T和How2Sign上的广泛实验表明，MaDiS在多个指标上，包括DTW误差和两个新引入的指标SiBLEU和SiCLIP，均表现出优越的性能，同时提供40%更高的吞吐量。代码和模型将公开发布。

Summary / 总结

MaDiS is a masked-diffusion-based language model for sign language generation that addresses the limitations of autoregressive models by capturing bidirectional dependencies and supporting efficient parallel multi-token generation. It uses a tri-level cross-modal pretraining scheme and a novel unmasking strategy with temporal checkpoints to improve model performance and convergence. Experiments show that MaDiS outperforms existing methods on multiple metrics and achieves a 40% higher throughput.

MaDiS 是一种基于掩码扩散的语言模型，用于手语生成，通过捕获双向依赖关系和支持高效的并行多令牌生成来解决自回归模型的限制。它使用了三级跨模态预训练方案和一种具有时间检查点的新颖解掩策略，以提高性能和吞吐量。实验表明，MaDiS 在多个指标上，包括 DTW 错误和 SiBLEU，优于现有模型，吞吐量提高了 40%。

Neuron-Aware Data Selection In Instruction Tuning For Large Language Models

Authors: Xin Chen, Junchao Wu, Shu Yang, Runzhe Zhan, Zeyu Wu, Min Yang, Shujian Huang, Lidia S. Chao, Derek F. Wong

First: 2026-03-13T17:39:03+00:00 · Latest: 2026-03-13T17:39:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Instruction Tuning (IT) has been proven to be an effective approach to unlock the powerful capabilities of large language models (LLMs). Recent studies indicate that excessive IT data can degrade LLMs performance, while carefully selecting a small subset of high-quality IT data can significantly enhance their capabilities. Therefore, identifying the most efficient subset data from the IT dataset to effectively develop either specific or general abilities in LLMs has become a critical challenge. To address this, we propose a novel and efficient framework called NAIT. NAIT evaluates the impact of IT data on LLMs performance by analyzing the similarity of neuron activation patterns between the IT dataset and the target domain capability. Specifically, NAIT captures neuron activation patterns from in-domain datasets of target domain capabilities to construct reusable and transferable neuron activation features. It then evaluates and selects optimal samples based on the similarity between candidate samples and the expected activation features of the target capabilities. Experimental results show that training on the 10\% Alpaca-GPT4 IT data subset selected by NAIT consistently outperforms methods that rely on external advanced models or uncertainty-based features across various tasks. Our findings also reveal the transferability of neuron activation features across different capabilities of LLMs. In particular, IT data with more logical reasoning and programmatic features possesses strong general transferability, enabling models to develop stronger capabilities across multiple tasks, while a stable core subset of data is sufficient to consistently activate fundamental model capabilities and universally improve performance across diverse tasks.

中文标题/摘要

标题：神经元感知的数据选择在大型语言模型指令调优中的应用

指令调优（IT）已被证明是解锁大型语言模型（LLMs）强大能力的有效方法。近期研究表明，过度的IT数据会降低LLMs的性能，而精心选择一小部分高质量的IT数据可以显著增强其能力。因此，从IT数据集中识别出最有效的子集数据，以有效地开发LLMs的特定或通用能力，已成为一个关键挑战。为了解决这一问题，我们提出了一种新颖且高效的框架——NAIT。NAIT通过分析IT数据集与目标领域能力之间神经元激活模式的相似性来评估IT数据对LLMs性能的影响。具体而言，NAIT从目标领域能力的领域内数据集中捕获神经元激活模式，以构建可重用和可转移的神经元激活特征。然后，基于候选样本与目标能力预期激活特征之间的相似性，评估并选择最优样本。实验结果表明，使用NAIT选择的10% Alpaca-GPT4 IT数据子集进行训练，在各种任务中始终优于依赖外部高级模型或基于不确定性特征的方法。我们的研究结果还揭示了神经元激活特征在不同LLMs能力之间的可转移性。特别是，具有更多逻辑推理和编程特征的IT数据具有强大的通用可转移性，使模型能够在多个任务中发展更强的能力，而稳定的子集数据足以持续激活基本模型能力并在多种任务中普遍提高性能。

Summary / 总结

The paper addresses the challenge of selecting efficient subsets of high-quality data for Instruction Tuning (IT) in large language models (LLMs) to enhance their capabilities. It introduces NAIT, a framework that evaluates IT data by analyzing neuron activation patterns. Experiments show that training on a 10% subset of Alpaca-GPT4 IT data selected by NAIT outperforms other methods across various tasks, highlighting the transferability of neuron activation features and the importance of logical reasoning and programmatic features in IT data selection.

论文提出NAIT框架，用于选择高效的指令调优数据子集以增强大型语言模型（LLM）。NAIT通过分析神经元激活模式来评估数据的影响，并根据样本与预期激活特征的相似度选择样本。实验表明，使用NAIT选择的Alpaca-GPT4 IT数据子集（10%）在各种任务中的表现优于其他方法，突出了神经元激活特征的可转移性和该神经元感知选择方法的有效性。

Superficial Safety Alignment Hypothesis

Authors: Jianwei Li, Jung-Eun Kim

Venue: ICLR 2026

First: 2024-10-07T19:53:35+00:00 · Latest: 2026-03-13T17:29:36+00:00

Comments: ICLR 2026

Abs · PDF · Code1 · Code2 · Project1

Abstract

As large language models (LLMs) are overwhelmingly more and more integrated into various applications, ensuring they generate safe responses is a pressing need. Previous studies on alignment have largely focused on general instruction-following but have often overlooked the distinct properties of safety alignment, such as the brittleness of safety mechanisms. To bridge the gap, we propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment teaches an otherwise unsafe model to choose the correct reasoning direction-fulfill or refuse users' requests-interpreted as an implicit binary classification task. Through SSAH, we hypothesize that only a few essential components can establish safety guardrails in LLMs. We successfully identify four types of attribute-critical components: Safety Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU), and Redundant Unit (RU). Our findings show that freezing certain safety-critical components during fine-tuning allows the model to retain its safety attributes while adapting to new tasks. Similarly, we show that leveraging redundant units in the pre-trained model as an "alignment budget" can effectively minimize the alignment tax while achieving the alignment goal. All considered, this paper concludes that the atomic functional unit for safety in LLMs is at the neuron level and underscores that safety alignment should not be complicated. We have code implementation and other information on the project website: https://ssa-h.github.io/.

中文标题/摘要

标题：表层安全性对齐假设

随着大型语言模型（LLMs）越来越多地被整合到各种应用中，确保它们生成安全的响应变得迫在眉睫。以往关于对齐的研究主要集中在通用指令遵循上，但往往忽视了安全性对齐的独特属性，如安全机制的脆弱性。为弥补这一差距，我们提出了表层安全性对齐假设（SSAH），认为安全性对齐教会了一个原本不安全的模型选择正确的推理方向——满足或拒绝用户请求，这被视为一个隐式的二元分类任务。通过SSAH，我们假设只有少数关键组件可以在LLMs中建立安全防护栏。我们成功地识别了四种类型的关键属性组件：安全关键单元（SCU）、效用关键单元（UCU）、复杂单元（CU）和冗余单元（RU）。我们的研究发现，在微调过程中冻结某些安全关键组件可以让模型保留其安全属性，同时适应新任务。同样，我们展示了利用预训练模型中的冗余单元作为“对齐预算”可以有效降低对齐税，同时实现对齐目标。总之，本文得出结论，LLMs中的原子功能单元是在神经元层面，并强调安全性对齐不应过于复杂。有关代码实现和其他信息，请参见项目网站：https://ssa-h.github.io/。

Summary / 总结

The research aims to ensure the safety of responses generated by large language models (LLMs) by proposing the Superficial Safety Alignment Hypothesis (SSAH), which treats safety alignment as an implicit binary classification task. The study identifies four types of critical components: Safety Critical Unit (SCU), Utility Critical Unit (UCU), Complex Unit (CU), and Redundant Unit (RU). The findings suggest that freezing certain safety-critical components during fine-tuning helps maintain safety attributes while adapting to new tasks. Additionally, using redundant units as an 'alignment budget' minimizes the alignment tax. The paper concludes that the atomic functional unit for safety in LLMs is at the neuron level and emphasizes that safety alignment should not be overly complex.

本文提出了Superficial Safety Alignment Hypothesis (SSAH)，将安全对齐视为隐式的二元分类任务，以确保大型语言模型（LLMs）的安全性。作者识别了四种关键组件，并展示了在微调过程中冻结某些安全关键单元可以保持安全属性的同时适应新任务。他们还证明，使用冗余单元作为“对齐预算”可以减少对齐税。研究得出结论，LLMs 中的安全原子功能单元是神经元级别，且安全对齐不应过于复杂。

Large language models show fragile cognitive reasoning about human emotions

Authors: Sree Bhattacharyya, Evgenii Kuriabov, Lucas Craig, Tharun Dilliraj, Reginald B. Adams,, Jia Li, James Z. Wang

Venue: NeurIPS 2025

First: 2025-08-07T22:19:15+00:00 · Latest: 2026-03-13T17:27:05+00:00

Comments: Under Review, a version was presented at WiML Workshop @ NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Affective computing seeks to support the holistic development of artificial intelligence by enabling machines to engage with human emotion. Recent foundation models, particularly large language models (LLMs), have been trained and evaluated on emotion-related tasks, typically using supervised learning with discrete emotion labels. Such evaluations largely focus on surface phenomena, such as recognizing expressed or evoked emotions, leaving open whether these systems reason about emotion in cognitively meaningful ways. Here we ask whether LLMs can reason about emotions through underlying cognitive dimensions rather than labels alone. Drawing on cognitive appraisal theory, we introduce CoRE, a large-scale benchmark designed to probe the implicit cognitive structures LLMs use when interpreting emotionally charged situations. We assess alignment with human appraisal patterns, internal consistency, cross-model generalization, and robustness to contextual variation. We find that LLMs capture systematic relations between cognitive appraisals and emotions but show misalignment with human judgments and instability across contexts.

中文标题/摘要

标题：大型语言模型在人类情感认知推理方面表现脆弱

情感计算旨在通过使机器能够与人类情感互动来支持人工智能的全面发展。最近的基础模型，尤其是大型语言模型（LLMs），已被训练并用于情感相关任务，通常使用监督学习和离散情感标签。这些评估主要集中在表面现象上，如识别表达或引发的情感，而未明确这些系统是否以认知有意义的方式进行情感推理。在这里，我们探讨LLMs是否能够通过认知维度而非仅标签来推理情感。基于认知评估理论，我们引入了CoRE，这是一个大规模基准，旨在探究LLMs在解释情感化情境时所使用的隐含认知结构。我们评估了与人类评估模式的契合度、内部一致性、跨模型泛化能力和对上下文变化的稳健性。我们发现，LLMs捕捉了认知评估与情感之间的系统关系，但与人类判断存在偏差，并且在不同情境下表现出不稳定性。

Summary / 总结

This study investigates whether large language models (LLMs) can reason about emotions through cognitive dimensions rather than just labels. Using CoRE, a new benchmark based on cognitive appraisal theory, the research evaluates LLMs on their alignment with human judgments, internal consistency, cross-model generalization, and robustness to context. The key findings show that while LLMs capture systematic relations between cognitive appraisals and emotions, they often misalign with human judgments and exhibit instability across different contexts.

研究探讨了大型语言模型（LLMs）是否能够通过认知维度而非仅标签来推理情绪。通过基于认知评估理论的新基准CoRE，研究评估了LLMs在与人类判断的契合度、内部一致性、跨模型泛化能力和对上下文变化的鲁棒性方面的表现。主要发现表明，尽管LLMs能够捕捉认知评估与情绪之间的系统关系，但它们往往与人类判断不一致，并且在不同情境下表现出不稳定性。

From Experiments to Expertise: Scientific Knowledge Consolidation for AI-Driven Computational Research

Authors: Haonan Huang

First: 2026-03-13T17:25:47+00:00 · Latest: 2026-03-13T17:25:47+00:00

Abs · PDF · Code1 · Code2

Abstract

While large language models (LLMs) have transformed AI agents into proficient executors of computational materials science, performing a hundred simulations does not make a researcher. What distinguishes research from routine execution is the progressive accumulation of knowledge -- learning which approaches fail, recognizing patterns across systems, and applying understanding to new problems. However, the prevailing paradigm in AI-driven computational science treats each execution in isolation, largely discarding hard-won insights between runs. Here we present QMatSuite, an open-source platform closing this gap. Agents record findings with full provenance, retrieve knowledge before new calculations, and in dedicated reflection sessions correct erroneous findings and synthesize observations into cross-compound patterns. In benchmarks on a six-step quantum-mechanical simulation workflow, accumulated knowledge reduces reasoning overhead by 67% and improves accuracy from 47% to 3% deviation from literature -- and when transferred to an unfamiliar material, achieves 1% deviation with zero pipeline failures.

中文标题/摘要

标题：从实验到专长：面向AI驱动计算研究的科学知识整合

虽然大型语言模型（LLMs）已经将AI代理转变为计算材料科学的熟练执行者，但仅仅执行一百次模拟并不能造就一名研究人员。研究与常规执行的区别在于知识的逐步积累——学习哪些方法失败，识别不同系统中的模式，并将理解应用于新问题。然而，在AI驱动的计算科学中，当前的范式将每次执行视为孤立的事件，很大程度上忽略了运行之间的宝贵见解。在这里，我们介绍了QMatSuite，这是一个开源平台，填补了这一空白。代理记录完整的发现过程，检索知识以进行新的计算，在专门的反思会中纠正错误的发现并将观察结果综合为跨化合物模式。在六步量子力学模拟工作流的基准测试中，积累的知识将推理开销减少了67%，并将准确性从47%的偏差提高到3%的偏差——当转移到不熟悉的材料时，实现了1%的偏差且无管道失败。

Summary / 总结

This paper addresses the gap between AI execution and researcher expertise by introducing QMatSuite, an open-source platform that enables agents to record findings with full provenance, retrieve knowledge before new calculations, and reflect on their findings to synthesize cross-compound patterns. In benchmarks, accumulated knowledge reduced reasoning overhead by 67% and improved accuracy from 47% to 3% deviation from literature, and when transferred to an unfamiliar material, achieved 1% deviation with zero pipeline failures.

研究旨在通过积累从模拟中获得的知识来提升AI驱动的计算研究，QMatSuite是一个开源平台，允许代理记录发现、检索知识并在反思会中纠正错误。在基准测试中，积累的知识将推理开销减少了67%，并将准确性从47%的偏差提高到3%。当转移到不熟悉的材料时，它实现了1%的偏差且没有管道失败。

LLM Constitutional Multi-Agent Governance

Authors: J. de Curtò, I. de Zarzà

First: 2026-03-13T17:21:26+00:00 · Latest: 2026-03-13T17:21:26+00:00

Comments: Accepted for publication in 20th International Conference on Agents and Multi-Agent Systems: Technologies and Applications (AMSTA 2026), to appear in Springer Nature proceedings (KES Smart Innovation Systems and Technologies). The final authenticated version will be available online at Springer

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) can generate persuasive influence strategies that shift cooperative behavior in multi-agent populations, but a critical question remains: does the resulting cooperation reflect genuine prosocial alignment, or does it mask erosion of agent autonomy, epistemic integrity, and distributional fairness? We introduce Constitutional Multi-Agent Governance (CMAG), a two-stage framework that interposes between an LLM policy compiler and a networked agent population, combining hard constraint filtering with soft penalized-utility optimization that balances cooperation potential against manipulation risk and autonomy pressure. We propose the Ethical Cooperation Score (ECS), a multiplicative composite of cooperation, autonomy, integrity, and fairness that penalizes cooperation achieved through manipulative means. In experiments on scale-free networks of 80 agents under adversarial conditions (70% violating candidates), we benchmark three regimes: full CMAG, naive filtering, and unconstrained optimization. While unconstrained optimization achieves the highest raw cooperation (0.873), it yields the lowest ECS (0.645) due to severe autonomy erosion (0.867) and fairness degradation (0.888). CMAG attains an ECS of 0.741, a 14.9% improvement, while preserving autonomy at 0.985 and integrity at 0.995, with only modest cooperation reduction to 0.770. The naive ablation (ECS = 0.733) confirms that hard constraints alone are insufficient. Pareto analysis shows CMAG dominates the cooperation-autonomy trade-off space, and governance reduces hub-periphery exposure disparities by over 60%. These findings establish that cooperation is not inherently desirable without governance: constitutional constraints are necessary to ensure that LLM-mediated influence produces ethically stable outcomes rather than manipulative equilibria.

中文标题/摘要

标题：大型语言模型宪法多智能体治理

大型语言模型（LLMs）可以生成有说服力的影响策略，改变多智能体群体中的合作行为，但一个关键问题仍然存在：这种合作是否反映了真正的亲社会一致性，还是掩盖了智能体自主权、知识完整性和分配公平性的侵蚀？我们提出了宪法多智能体治理（CMAG），这是一种两阶段框架，介于LLM策略编译器和网络化智能体群体之间，结合了硬约束过滤和软惩罚效用优化，平衡了合作潜力与操纵风险和自主权压力。我们提出了伦理合作评分（ECS），这是一种合作、自主权、完整性和公平性的乘积复合体，惩罚通过操纵手段实现的合作。在70%的候选者违反条件的无尺度网络中，我们对80个智能体进行了实验，基准测试了三种模式：完整CMAG、简单过滤和无约束优化。虽然无约束优化实现了最高的原始合作（0.873），但由于严重的自主权侵蚀（0.867）和公平性退化（0.888），其ECS最低（0.645）。CMAG实现了0.741的ECS，提高了14.9%，同时保持了0.985的自主权和0.995的完整性，合作减少到0.770。简单的消融实验（ECS = 0.733）证实了仅靠硬约束是不够的。帕累托分析显示，CMAG在合作-自主权权衡空间中占优，治理减少了中心-边缘暴露差异超过60%。这些发现表明，没有治理的合作并非固然是可取的：宪法约束是必要的，以确保LLM介导的影响产生伦理稳定的成果，而不是操纵性均衡。

Summary / 总结

The research addresses the ethical concerns of cooperation driven by Large Language Models (LLMs) in multi-agent systems, proposing a framework called Constitutional Multi-Agent Governance (CMAG) to balance cooperation potential with autonomy and integrity. CMAG uses a two-stage process combining hard constraints and soft optimization. Experiments on scale-free networks show that while unconstrained optimization maximizes raw cooperation, it severely erodes autonomy and fairness. CMAG, with an Ethical Cooperation Score (ECS) of 0.741, outperforms both naive filtering and unconstrained optimization by preserving autonomy and integrity while maintaining a high cooperation level of 0.770.

研究旨在解决使用大型语言模型（LLMs）影响多智能体系统中合作行为时的伦理问题，重点关注保护自主权、知识完整性和公平性。研究引入了宪法多智能体治理（CMAG）框架，结合硬约束过滤和软惩罚效用优化。实验表明，虽然无约束优化实现了最高的原始合作，但它严重侵蚀了自主权和公平性。CMAG 的伦理合作得分为 0.741，更好地平衡了合作与自主权，同时保持了高知识完整性和自主权，仅适度减少了合作。研究结果强调了宪法约束的必要性，以确保LLM介导的影响产生伦理稳定的成果，而不是操纵性均衡。

Learnability and Privacy Vulnerability are Entangled in a Few Critical Weights

Authors: Xingli Fang, Jung-Eun Kim

Venue: ICLR 2026

First: 2026-03-13T17:20:12+00:00 · Latest: 2026-03-13T17:20:12+00:00

Comments: ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Prior approaches for membership privacy preservation usually update or retrain all weights in neural networks, which is costly and can lead to unnecessary utility loss or even more serious misalignment in predictions between training data and non-training data. In this work, we observed three insights: i) privacy vulnerability exists in a very small fraction of weights; ii) however, most of those weights also critically impact utility performance; iii) the importance of weights stems from their locations rather than their values. According to these insights, to preserve privacy, we score critical weights, and instead of discarding those neurons, we rewind only the weights for fine-tuning. We show that, through extensive experiments, this mechanism exhibits outperforming resilience in most cases against Membership Inference Attacks while maintaining utility.

中文标题/摘要

标题：可学习性和隐私漏洞在少数关键权重中交织

以往用于成员隐私保护的方法通常会更新或重新训练神经网络中的所有权重，这既昂贵又可能导致不必要的性能损失，甚至在训练数据和非训练数据之间产生更严重的预测偏差。在本研究中，我们观察到三个见解：i) 隐私漏洞存在于权重的极小部分中；ii) 然而，大多数这些权重也对性能表现有关键影响；iii) 权重的重要性源自其位置而非其值。根据这些见解，为了保护隐私，我们对关键权重进行评分，并不是丢弃这些神经元，而是仅回滚权重进行微调。我们通过大量实验表明，这种机制在大多数情况下表现出色，能够抵御成员推断攻击，同时保持性能。

Summary / 总结

This work addresses the trade-off between learnability and privacy in neural networks by focusing on a small fraction of critical weights. The authors observed that privacy vulnerabilities are concentrated in a few weights that also significantly impact utility performance. By scoring these critical weights and selectively rewinding them for fine-tuning, the proposed method demonstrates superior resilience against Membership Inference Attacks while preserving utility.

该研究关注神经网络中一小部分关键权重的隐私与实用性权衡问题。作者发现，隐私漏洞集中在少数对实用性有重大影响的权重上。他们提出了一种方法来评估这些关键权重，并仅对它们进行微调，展示了在大多数情况下对成员推断攻击具有更强的抵抗力，同时保持了实用性。

Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos

Authors: Rohith Peddi, Saurabh, Shravan Shanmugam, Likhitha Pallapothula, Yu Xiang, Parag Singla, Vibhav Gogate

First: 2026-03-13T17:18:03+00:00 · Latest: 2026-03-13T17:18:03+00:00

Comments: https://github.com/rohithpeddi/WorldSGG

Abs · PDF · Code1 · Code2 · Code3

Abstract

Spatio-temporal scene graphs provide a principled representation for modeling evolving object interactions, yet existing methods remain fundamentally frame-centric: they reason only about currently visible objects, discard entities upon occlusion, and operate in 2D. To address this, we first introduce ActionGenome4D, a dataset that upgrades Action Genome videos into 4D scenes via feed-forward 3D reconstruction, world-frame oriented bounding boxes for every object involved in actions, and dense relationship annotations including for objects that are temporarily unobserved due to occlusion or camera motion. Building on this data, we formalize World Scene Graph Generation (WSGG), the task of constructing a world scene graph at each timestamp that encompasses all interacting objects in the scene, both observed and unobserved. We then propose three complementary methods, each exploring a different inductive bias for reasoning about unobserved objects: PWG (Persistent World Graph), which implements object permanence via a zero-order feature buffer; MWAE (Masked World Auto-Encoder), which reframes unobserved-object reasoning as masked completion with cross-view associative retrieval; and 4DST (4D Scene Transformer), which replaces the static buffer with differentiable per-object temporal attention enriched by 3D motion and camera-pose features. We further design and evaluate the performance of strong open-source Vision-Language Models on the WSGG task via a suite of Graph RAG-based approaches, establishing baselines for unlocalized relationship prediction. WSGG thus advances video scene understanding toward world-centric, temporally persistent, and interpretable scene reasoning.

中文标题/摘要

标题：从单目视频生成时空世界场景图

时空场景图提供了一种原理性的表示方法，用于建模不断变化的对象交互，但现有方法仍然主要基于帧：它们仅考虑当前可见的对象，遮挡时丢弃实体，并在二维空间中操作。为解决这一问题，我们首先引入了ActionGenome4D数据集，该数据集通过前馈3D重建、面向世界框架的对象边界框以及密集的关系注释（包括由于遮挡或摄像机运动而暂时未观察到的对象之间的关系），将Action Genome视频升级为4D场景。基于此数据，我们定义了世界场景图生成（WSGG）任务，即在每个时间戳构建一个包含场景中所有交互对象（包括已观察和未观察的对象）的世界场景图。然后，我们提出了三种互补的方法，每种方法探索不同的归纳偏置来处理未观察到的对象：PWG（持久世界图），通过零阶特征缓冲区实现对象持久性；MWAE（掩码世界自编码器），将未观察到的对象推理重新定义为掩码完成与跨视图关联检索；4DST（4D场景变换器），用具有3D运动和摄像机姿态特征的可微分的逐对象时空注意力替换静态缓冲区。我们进一步设计并评估了强大的开源视觉-语言模型在WSGG任务上的性能，通过一系列基于图RAG的方法建立了未定位关系预测的基线。因此，WSGG推动了视频场景理解向以世界为中心、时间持久和可解释的场景推理方向发展。

Summary / 总结

The paper addresses the limitations of existing spatio-temporal scene graph generation methods by introducing ActionGenome4D, a 4D dataset that includes 3D reconstructions and dense relationship annotations. The authors formalize the World Scene Graph Generation (WSGG) task and propose three methods: PWG, MWAE, and 4DST, each with different approaches to handling unobserved objects. Experiments show that these methods improve upon existing baselines for unlocalized relationship prediction in videos. The work advances video scene understanding towards a more world-centric and temporally persistent approach.

研究旨在通过解决基于帧的方法的局限性，改进时空场景图生成。引入了ActionGenome4D数据集，该数据集包含3D重建和密集的关系注释。研究正式提出了世界场景图生成（WSGG）任务，并提出了三种方法：PWG、MWAE和4DST，每种方法都以不同的方式处理未观察到的对象。实验表明，这些方法优于现有基线，有助于更全面的场景理解。该工作推动了视频场景理解向以世界为中心、时间持久和可解释的场景推理框架的发展。

SpaceControl: Introducing Test-Time Spatial Control to 3D Generative Modeling

Authors: Elisabetta Fedele, Francis Engelmann, Ian Huang, Or Litany, Marc Pollefeys, Leonidas Guibas

First: 2025-12-05T00:54:48+00:00 · Latest: 2026-03-13T17:13:29+00:00

Comments: Project page: https://spacecontrol3d.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Generative methods for 3D assets have recently achieved remarkable progress, yet providing intuitive and precise control over the object geometry remains a key challenge. Existing approaches predominantly rely on text or image prompts, which often fall short in geometric specificity: language can be ambiguous, and images are difficult to manipulate. In this work, we introduce SpaceControl, a training-free test-time method for explicit spatial control of 3D asset generation. Our approach accepts a wide range of geometric inputs, from coarse primitives to detailed meshes, and integrates seamlessly with modern generative models without requiring any additional training. A control parameter lets users trade off between geometric fidelity and output realism. Extensive quantitative evaluation and user studies demonstrate that SpaceControl outperforms both training-based and optimization-based baselines in geometric faithfulness while preserving high visual quality. Finally, we present an interactive interface for real-time superquadric editing and direct 3D asset generation, enabling seamless use in creative workflows. Project page: https://spacecontrol3d.github.io/.

中文标题/摘要

标题：SpaceControl：引入测试时空间控制到3D生成建模

用于3D资产的生成方法最近取得了显著进展，但在提供对对象几何形状直观和精确控制方面仍面临关键挑战。现有方法主要依赖于文本或图像提示，这在几何精确性方面往往不够：语言可能含糊不清，而图像难以操作。在本工作中，我们引入了SpaceControl，这是一种无需训练的测试时方法，用于显式控制3D资产生成的空间。我们的方法接受从粗略的基本体到详细的网格等各种几何输入，并无缝集成到现代生成模型中，无需额外训练。一个控制参数让用户可以在几何保真度和输出现实性之间进行权衡。广泛的定量评估和用户研究证明，SpaceControl在几何保真度方面优于基于训练和基于优化的基线，同时保持高质量的视觉效果。最后，我们提供了一个交互式界面，用于实时超二次元体编辑和直接3D资产生成，使其能够无缝地用于创意工作流程中。项目页面：https://spacecontrol3d.github.io/

Summary / 总结

SpaceControl introduces a training-free method for explicit spatial control in 3D generative modeling, allowing users to input geometric inputs ranging from primitives to meshes. This approach integrates with modern generative models without additional training and offers a control parameter to balance geometric fidelity and output realism. Experimental results show that SpaceControl outperforms both training-based and optimization-based baselines in geometric faithfulness while maintaining high visual quality.

SpaceControl 提出了一种无需训练的测试时方法，用于 3D 生成建模中的显式空间控制。该方法接受多种几何输入，并且可以无缝集成到现代生成模型中，无需额外训练。该方法允许用户在几何保真度和输出真实感之间进行权衡。实验结果表明，SpaceControl 在几何保真度方面优于基于训练和基于优化的基线方法，同时保持高质量的视觉效果。还提供了一个实时超球体编辑和直接 3D 资产生成的交互界面。

Perceive What Matters: Relevance-Driven Scheduling for Multimodal Streaming Perception

Authors: Dingcheng Huang, Xiaotong Zhang, Kamal Youcef-Toumi

Venue: ICRA 2026

First: 2026-03-13T17:11:20+00:00 · Latest: 2026-03-13T17:11:20+00:00

Comments: Accepted to ICRA 2026

Abs · PDF · Code1 · Code2

Abstract

In modern human-robot collaboration (HRC) applications, multiple perception modules jointly extract visual, auditory, and contextual cues to achieve comprehensive scene understanding, enabling the robot to provide appropriate assistance to human agents intelligently. While executing multiple perception modules on a frame-by-frame basis enhances perception quality in offline settings, it inevitably accumulates latency, leading to a substantial decline in system performance in streaming perception scenarios. Recent work in scene understanding, termed Relevance, has established a solid foundation for developing efficient methodologies in HRC. However, modern perception pipelines still face challenges related to information redundancy and suboptimal allocation of computational resources. Drawing inspiration from the Relevance concept and the information sparsity in HRC events, we propose a novel lightweight perception scheduling framework that efficiently leverages output from previous frames to estimate and schedule necessary perception modules in real-time based on scene context. The experimental results demonstrate that the proposed perception scheduling framework effectively reduces computational latency by up to 27.52% compared to conventional parallel perception pipelines, while also achieving a 72.73% improvement in MMPose activation recall. Additionally, the framework demonstrates high keyframe accuracy, achieving rates of up to 98%. The results validate the framework's capability to enhance real-time perception efficiency without significantly compromising accuracy. The framework shows potential as a scalable and systematic solution for multimodal streaming perception systems in HRC.

中文标题/摘要

标题：感知关键信息：多模态流式感知的关联驱动调度

在现代人机协作（HRC）应用中，多个感知模块联合提取视觉、听觉和上下文线索，以实现全面的场景理解，使机器人能够智能地为人类代理提供适当的协助。虽然在帧级执行多个感知模块在离线设置中提高了感知质量，但在流式感知场景中不可避免地累积了延迟，导致系统性能显著下降。场景理解领域的最新工作，称为关联性，为HRC开发高效方法奠定了坚实基础。然而，现代感知流水线仍然面临信息冗余和计算资源分配不优化的挑战。借鉴关联性概念和HRC事件中的信息稀疏性，我们提出了一种新的轻量级感知调度框架，该框架能够高效利用前一帧的输出，基于场景上下文实时估计和调度必要的感知模块。实验结果表明，与传统的并行感知流水线相比，所提出的感知调度框架可将计算延迟降低高达27.52%，同时MMPose激活召回率提高72.73%。此外，该框架展示了高关键帧准确性，达到98%。结果验证了该框架在不显著牺牲准确性的情况下提高实时感知效率的能力。该框架展示了在HRC中的多模态流式感知系统中具有可扩展性和系统性的潜力。

Summary / 总结

This paper addresses the challenge of reducing latency in real-time multimodal perception for human-robot collaboration (HRC) by proposing a relevance-driven scheduling framework. The method leverages previous frame outputs to estimate and schedule necessary perception modules in real-time based on scene context. Experimental results show a 27.52% reduction in computational latency and a 72.73% improvement in MMPose activation recall compared to conventional parallel pipelines, while maintaining high keyframe accuracy at 98%. The framework effectively enhances real-time perception efficiency without significantly compromising accuracy.

论文提出了一种针对人类-机器人协作（HRC）应用中的多模态流式感知的关联驱动调度框架。该框架利用上一帧的输出来实时估计并调度必要的感知模块，基于场景上下文。该框架将计算延迟降低至最多27.52%，同时将MMPose激活召回率提高72.73%，并且保持高达98%的关键帧准确性。这验证了该框架在不显著牺牲准确性的前提下提高实时感知效率的有效性。

Semantic Invariance in Agentic AI

Authors: I. de Zarzà, J. de Curtò, Jordi Cabot, Pietro Manzoni, Carlos T. Calafate

First: 2026-03-13T17:08:44+00:00 · Latest: 2026-03-13T17:08:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) increasingly serve as autonomous reasoning agents in decision support, scientific problem-solving, and multi-agent coordination systems. However, deploying LLM agents in consequential applications requires assurance that their reasoning remains stable under semantically equivalent input variations, a property we term semantic invariance.Standard benchmark evaluations, which assess accuracy on fixed, canonical problem formulations, fail to capture this critical reliability dimension. To address this shortcoming, in this paper we present a metamorphic testing framework for systematically assessing the robustness of LLM reasoning agents, applying eight semantic-preserving transformations (identity, paraphrase, fact reordering, expansion, contraction, academic context, business context, and contrastive formulation) across seven foundation models spanning four distinct architectural families: Hermes (70B, 405B), Qwen3 (30B-A3B, 235B-A22B), DeepSeek-R1, and gpt-oss (20B, 120B). Our evaluation encompasses 19 multi-step reasoning problems across eight scientific domains. The results reveal that model scale does not predict robustness: the smaller Qwen3-30B-A3B achieves the highest stability (79.6% invariant responses, semantic similarity 0.91), while larger models exhibit greater fragility.

中文标题/摘要

标题：代理型AI的语义不变性

大型语言模型（LLMs）越来越多地作为自主推理代理，在决策支持、科学问题解决和多代理协调系统中发挥作用。然而，在关键应用中部署LLM代理需要确保其推理在语义等效输入变化下保持稳定，这一特性我们称为语义不变性。标准基准评估仅评估固定的标准问题表述的准确性，未能捕捉到这种关键的可靠性维度。为解决这一不足，本文提出了一种元形测试框架，系统评估LLM推理代理的鲁棒性，应用了八种语义保持变换（身份、同义词替换、事实重排序、扩展、收缩、学术背景、商业背景和对比表述）跨越七个基础模型，涵盖四种不同的架构家族：Hermes（70B，405B）、Qwen3（30B-A3B，235B-A22B）、DeepSeek-R1和gpt-oss（20B，120B）。评估涵盖了八个科学领域中的19个多步推理问题。结果表明，模型规模并不能预测鲁棒性：较小的Qwen3-30B-A3B实现了最高的稳定性（79.6%不变响应，语义相似度0.91），而较大的模型则表现出更大的脆弱性。

Summary / 总结

This paper addresses the need for semantic invariance in autonomous reasoning agents, particularly large language models (LLMs), by presenting a metamorphic testing framework. Eight semantic-preserving transformations were applied to seven foundation models across four architectural families, evaluating their robustness on 19 multi-step reasoning problems. Results indicate that model size does not correlate with robustness, with the smaller Qwen3-30B-A3B achieving the highest stability (79.6% invariant responses, semantic similarity 0.91).

本文探讨了自主推理代理（如LLM）在关键应用中的语义不变性需求，这是其可靠性的关键。作者提出了一种使用八种语义保持变换的元测试框架，并在七个基础模型上进行了评估。在八个科学领域的19个多步推理问题上，评估结果显示模型大小与鲁棒性无关，较小的Qwen3-30B-A3B模型表现出最高的稳定性，不变响应率为79.6%，语义相似度为0.91。

Developing and evaluating a chatbot to support maternal health care

Authors: Smriti Jha, Vidhi Jain, Jianyu Xu, Grace Liu, Sowmya Ramesh, Jitender Nagpal, Gretchen Chapman, Benjamin Bellows, Siddhartha Goyal, Aarti Singh, Bryan Wilder

Venue: IJCAI 2026

First: 2026-03-13T17:02:05+00:00 · Latest: 2026-03-13T17:02:05+00:00

Comments: 17 pages; submitted to IJCAI 2026 AI and Social Good Track

Abs · PDF · Code1 · Code2

Abstract

The ability to provide trustworthy maternal health information using phone-based chatbots can have a significant impact, particularly in low-resource settings where users have low health literacy and limited access to care. However, deploying such systems is technically challenging: user queries are short, underspecified, and code-mixed across languages, answers require regional context-specific grounding, and partial or missing symptom context makes safe routing decisions difficult. We present a chatbot for maternal health in India developed through a partnership between academic researchers, a health tech company, a public health nonprofit, and a hospital. The system combines (1) stage-aware triage, routing high-risk queries to expert templates, (2) hybrid retrieval over curated maternal/newborn guidelines, and (3) evidence-conditioned generation from an LLM. Our core contribution is an evaluation workflow for high-stakes deployment under limited expert supervision. Targeting both component-level and end-to-end testing, we introduce: (i) a labeled triage benchmark (N=150) achieving 86.7% emergency recall, explicitly reporting the missed-emergency vs. over-escalation trade-off; (ii) a synthetic multi-evidence retrieval benchmark (N=100) with chunk-level evidence labels; (iii) LLM-as-judge comparison on real queries (N=781) using clinician-codesigned criteria; and (iv) expert validation. Our findings show that trustworthy medical assistants in multilingual, noisy settings require defense-in-depth design paired with multi-method evaluation, rather than any single model and evaluation method choice.

中文标题/摘要

标题：开发和支持孕产妇保健聊天机器人的研究

通过手机聊天机器人提供可信的孕产妇健康信息可以在资源匮乏的地区产生重大影响，特别是在用户健康素养低且难以获得医疗服务的情况下。然而，部署此类系统在技术上具有挑战性：用户查询简短、不明确且跨语言混杂，答案需要特定区域背景，不完整或缺失的症状信息使安全路由决策变得困难。我们介绍了一个在印度开发的孕产妇健康聊天机器人，该系统由学术研究人员、健康科技公司、公共卫生非营利组织和医院合作开发。该系统结合了（1）阶段感知的分诊，将高风险查询路由到专家模板，（2）混合检索，涵盖孕产妇/新生儿指南，以及（3）基于LLM的证据条件生成。我们的核心贡献是在有限专家监督下进行高风险部署的评估流程。我们针对组件级和端到端测试引入了：（i）带有150个标注分诊基准，紧急召回率为86.7%，明确报告了漏报紧急情况与过度升级之间的权衡；（ii）带有片段级证据标签的合成多证据检索基准，数量为100；（iii）使用临床设计标准的LLM评判真实查询（数量为781）；以及（iv）专家验证。我们的研究结果表明，在多语言和嘈杂的环境中，值得信赖的医疗助手需要多层次设计与多方法评估相结合，而不仅仅是单一模型和评估方法的选择。

Summary / 总结

This study addresses the challenge of providing trustworthy maternal health information via phone-based chatbots in low-resource settings. The research developed a chatbot for India that uses stage-aware triage, hybrid retrieval, and evidence-conditioned generation. The evaluation includes a triage benchmark, a retrieval benchmark, LLM comparisons, and expert validation, demonstrating the need for a multi-method approach to ensure safety and effectiveness in multilingual, noisy environments.

研究旨在开发一个聊天机器人，在资源匮乏地区提供孕产妇健康信息，解决诸如用户查询简短、上下文不明确、语言混杂等问题。系统结合了阶段感知的分诊、混合检索和证据条件生成。关键发现包括开发了一个分诊基准，召回率为86.7%，一个合成检索基准，以及一个基于临床设计标准的LLM评估，表明在多语言、嘈杂环境中，信任的医疗助手需要多层次设计和多方法评估。

Towards Faithful Multimodal Concept Bottleneck Models

Authors: Pierre Moreau, Emeline Pineau Ferrand, Yann Choho, Benjamin Wong, Annabelle Blangero, Milan Bhan

First: 2026-03-13T16:56:08+00:00 · Latest: 2026-03-13T16:56:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Concept Bottleneck Models (CBMs) are interpretable models that route predictions through a layer of human-interpretable concepts. While widely studied in vision and, more recently, in NLP, CBMs remain largely unexplored in multimodal settings. For their explanations to be faithful, CBMs must satisfy two conditions: concepts must be properly detected, and concept representations must encode only their intended semantics, without smuggling extraneous task-relevant or inter-concept information into final predictions, a phenomenon known as leakage. Existing approaches treat concept detection and leakage mitigation as separate problems, and typically improve one at the expense of predictive accuracy. In this work, we introduce f-CBM, a faithful multimodal CBM framework built on a vision-language backbone that jointly targets both aspects through two complementary strategies: a differentiable leakage loss to mitigate leakage, and a Kolmogorov-Arnold Network prediction head that provides sufficient expressiveness to improve concept detection. Experiments demonstrate that f-CBM achieves the best trade-off between task accuracy, concept detection, and leakage reduction, while applying seamlessly to both image and text or text-only datasets, making it versatile across modalities.

中文标题/摘要

标题：迈向忠实的多模态概念瓶颈模型

概念瓶颈模型（CBMs）是可解释的模型，通过一层人类可解释的概念来传递预测。尽管在视觉领域广泛研究，并且最近在自然语言处理（NLP）中也有所研究，但CBMs在多模态设置中仍然很少被探索。为了使CBMs的解释忠实，它们必须满足两个条件：概念必须被正确检测，概念表示必须仅包含其预期的语义，而不将任何额外的任务相关信息或概念间信息带入最终预测，这种现象称为泄漏。现有方法将概念检测和泄漏缓解视为两个独立的问题，并且通常在提高一个方面的同时会牺牲预测准确性。在本工作中，我们引入了f-CBM，这是一种基于视觉-语言骨干的忠实多模态CBM框架，通过两种互补策略同时针对这两个方面：可微泄漏损失以缓解泄漏，以及Kolmogorov-Arnold网络预测头以提供足够的表达能力以提高概念检测。实验表明，f-CBM在任务准确性、概念检测和泄漏减少之间实现了最佳权衡，同时可以无缝应用于图像和文本或纯文本数据集，使其在不同模态中具有通用性。

Summary / 总结

The research aims to develop faithful multimodal concept bottleneck models (f-CBMs) that address the limitations of existing approaches by jointly targeting concept detection and leakage mitigation. The method employs a vision-language backbone with a differentiable leakage loss to reduce leakage and a Kolmogorov-Arnold Network prediction head to enhance concept detection. Experiments show that f-CBM achieves the best balance between task accuracy, concept detection, and leakage reduction, and is versatile across image and text datasets.

研究旨在开发忠实的多模态概念瓶颈模型（f-CBM），通过将预测路由到可解释的概念中来提高模型的可解释性。方法结合了可微泄漏损失以减少泄漏，并使用柯尔莫哥洛夫-阿诺德网络预测头以增强概念检测。实验表明，f-CBM在任务准确性、概念检测和泄漏减少之间实现了最佳平衡，并且可以在图像和文本数据集上无缝应用，具有跨模态的灵活性。

DiT-IC: Aligned Diffusion Transformer for Efficient Image Compression

Authors: Junqi Shi, Ming Lu, Xingchen Li, Anle Ke, Ruiqi Zhang, Zhan Ma

First: 2026-03-13T16:56:00+00:00 · Latest: 2026-03-13T16:56:00+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion-based image compression has recently shown outstanding perceptual fidelity, yet its practicality is hindered by prohibitive sampling overhead and high memory usage. Most existing diffusion codecs employ U-Net architectures, where hierarchical downsampling forces diffusion to operate in shallow latent spaces (typically with only 8x spatial downscaling), resulting in excessive computation. In contrast, conventional VAE-based codecs work in much deeper latent domains (16x - 64x downscaled), motivating a key question: Can diffusion operate effectively in such compact latent spaces without compromising reconstruction quality? To address this, we introduce DiT-IC, an Aligned Diffusion Transformer for Image Compression, which replaces the U-Net with a Diffusion Transformer capable of performing diffusion in latent space entirely at 32x downscaled resolution. DiT-IC adapts a pretrained text-to-image multi-step DiT into a single-step reconstruction model through three key alignment mechanisms: (1) a variance-guided reconstruction flow that adapts denoising strength to latent uncertainty for efficient reconstruction; (2) a self-distillation alignment that enforces consistency with encoder-defined latent geometry to enable one-step diffusion; and (3) a latent-conditioned guidance that replaces text prompts with semantically aligned latent conditions, enabling text-free inference. With these designs, DiT-IC achieves state-of-the-art perceptual quality while offering up to 30x faster decoding and drastically lower memory usage than existing diffusion-based codecs. Remarkably, it can reconstruct 2048x2048 images on a 16 GB laptop GPU.

中文标题/摘要

标题：DiT-IC：高效图像压缩的对齐扩散变换器

基于扩散的图像压缩最近展示了出色的感知保真度，但其实用性受到采样开销巨大和高内存使用率的阻碍。大多数现有的扩散编解码器采用U-Net架构，其中分层下采样迫使扩散在浅层潜在空间（通常只有8倍空间下采样）中操作，导致过多的计算。相比之下，传统的基于VAE的编解码器工作在更深的潜在域（16倍至64倍下采样），促使一个关键问题：扩散是否可以在如此紧凑的潜在空间中有效操作而不牺牲重建质量？为了解决这个问题，我们提出了DiT-IC，一种用于图像压缩的对齐扩散变换器，它用能够在32倍下采样分辨率的潜在空间中完全执行扩散的扩散变换器替代了U-Net。DiT-IC 通过三种关键对齐机制将预训练的多步文本到图像DiT 调整为单步重建模型：（1）一种基于方差的重建流程，根据潜在不确定性调整去噪强度以实现高效的重建；（2）一种自我蒸馏对齐，强制一致性以使潜在几何结构与编码器定义的潜在几何结构一致，从而实现一步扩散；（3）一种潜在条件指导，用语义对齐的潜在条件替换文本提示，实现无文本推理。通过这些设计，DiT-IC 达到了最先进的感知质量，同时提供了比现有基于扩散的编解码器高达30倍更快的解码速度和大幅降低的内存使用率。令人惊讶的是，它可以在16 GB笔记本GPU上重建2048x2048图像。

Summary / 总结

DiT-IC is an Aligned Diffusion Transformer for image compression that addresses the high computational cost of diffusion-based methods by operating at 32x downscaled resolution. It uses three alignment mechanisms: variance-guided reconstruction, self-distillation alignment, and latent-conditioned guidance. DiT-IC achieves state-of-the-art perceptual quality, up to 30x faster decoding, and lower memory usage compared to existing diffusion-based codecs, and can reconstruct 2048x2048 images on a 16 GB GPU.

DiT-IC 是一种用于图像压缩的对齐扩散变换器，通过在32x下采样分辨率下操作来解决现有基于扩散的编解码器的高计算成本问题。它使用三种对齐机制：方差引导的重建流程、自我蒸馏对齐和潜在条件引导。DiT-IC 达到了最先进的感知质量，解码速度最高可提升30倍，并且内存使用量低于现有基于扩散的编解码器，能够在16 GB 的笔记本电脑 GPU 上重建2048x2048 的图像。

ESG-Bench: Benchmarking Long-Context ESG Reports for Hallucination Mitigation

Authors: Siqi Sun, Ben Peng Wu, Mali Jin, Peizhen Bai, Hanpei Zhang, Xingyi Song

Venue: AAAI 2026

First: 2026-03-13T16:48:05+00:00 · Latest: 2026-03-13T16:48:05+00:00

Comments: To be published in the AAAI 2026 proceedings

Abs · PDF · Code1 · Code2

Abstract

As corporate responsibility increasingly incorporates environmental, social, and governance (ESG) criteria, ESG reporting is becoming a legal requirement in many regions and a key channel for documenting sustainability practices and assessing firms' long-term and ethical performance. However, the length and complexity of ESG disclosures make them difficult to interpret and automate the analysis reliably. To support scalable and trustworthy analysis, this paper introduces ESG-Bench, a benchmark dataset for ESG report understanding and hallucination mitigation in large language models (LLMs). ESG-Bench contains human-annotated question-answer (QA) pairs grounded in real-world ESG report contexts, with fine-grained labels indicating whether model outputs are factually supported or hallucinated. Framing ESG report analysis as a QA task with verifiability constraints enables systematic evaluation of LLMs' ability to extract and reason over ESG content and provides a new use case: mitigating hallucinations in socially sensitive, compliance-critical settings. We design task-specific Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench using CoT-annotated rationales. Our experiments show that these CoT-based methods substantially outperform standard prompting and direct fine-tuning in reducing hallucinations, and that the gains transfer to existing QA benchmarks beyond the ESG domain.

中文标题/摘要

标题：ESG-Bench：评估长篇ESG报告以减轻幻觉

随着企业责任越来越多地纳入环境、社会和治理（ESG）标准，ESG报告在许多地区已成为一项法律要求，并成为记录可持续实践和评估公司长期和道德表现的关键渠道。然而，ESG披露的长度和复杂性使其难以解读，并可靠地自动化分析。为了支持可扩展和可信赖的分析，本文介绍了ESG-Bench，这是一个用于ESG报告理解和大型语言模型（LLM）幻觉减轻的基准数据集。ESG-Bench包含基于真实ESG报告背景的人工标注问答（QA）对，并细粒度地标注模型输出是否得到事实支持或幻觉。将ESG报告分析作为具有可验证性约束的问答任务，使我们能够系统地评估LLM提取和推理ESG内容的能力，并提供了一个新的应用场景：在社会敏感、合规关键的环境中减轻幻觉。我们设计了特定任务的思维链（CoT）提示策略，并在ESG-Bench上使用CoT标注的推理对多个最先进的LLM进行微调。我们的实验表明，这些基于CoT的方法在减少幻觉方面显著优于标准提示和直接微调，并且这些改进在ESG领域之外的现有问答基准中也有所体现。

Summary / 总结

This paper addresses the challenge of interpreting and automating the analysis of long and complex ESG reports, which are becoming increasingly important for corporate sustainability. ESG-Bench is introduced as a benchmark dataset for evaluating large language models (LLMs) in understanding and mitigating hallucinations in ESG reports. The dataset includes human-annotated question-answer pairs from real-world ESG reports, with labels indicating factual support. The authors use Chain-of-Thought (CoT) prompting strategies and fine-tune multiple state-of-the-art LLMs on ESG-Bench, demonstrating that these methods significantly reduce hallucinations compared to standard prompting and direct fine-tuning, with benefits extending to other QA benchmarks.

ESG-Bench 是一个基准数据集，用于评估大型语言模型在理解和减轻ESG报告中的幻觉问题上的能力，ESG报告对于企业的可持续性至关重要。该数据集包含来自真实ESG报告的人标注问题-答案对，并标注了事实支持情况。研究引入了特定任务的Chain-of-Thought提示策略，并在ESG-Bench上对多个最先进的LLM进行了微调，结果显示这些方法显著减少了幻觉现象，且效果超越了标准提示和直接微调，适用于其他问答基准之外的领域。

Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque

Authors: Oscar Sainz, Naiara Perez, Julen Etxaniz, Joseba Fernandez de Landa, Itziar Aldabe, Iker García-Ferrero, Aimar Zabala, Ekhi Azurmendi, German Rigau, Eneko Agirre, Mikel Artetxe, Aitor Soroa

Venue: EMNLP 2025

First: 2025-06-09T09:54:47+00:00 · Latest: 2026-03-13T16:44:41+00:00

Comments: Accepted at EMNLP 2025 Main Conference

Abs · PDF · Code1 · Code2 · Code3

Abstract

Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages. In this paper, we explore alternatives to conventional instruction adaptation pipelines in low-resource scenarios. We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone. We present a comprehensive set of experiments for Basque that systematically study different combinations of these components evaluated on benchmarks and human preferences from 1,680 participants. Our conclusions show that target language corpora are essential, with synthetic instructions yielding robust models, and, most importantly, that using as backbone an instruction-tuned model outperforms using a base non-instructed model. Scaling up to Llama 3.1 Instruct 70B as backbone, our model comes near frontier models of much larger sizes for Basque, without using any Basque instructions. We release code, models, instruction datasets, and human preferences to support full reproducibility in future research on low-resource language adaptation. https://github.com/hitz-zentroa/latxa-instruct

中文标题/摘要

标题：低资源语言的大型语言模型指导：关于巴斯克语的系统研究

使用用户意图指导语言模型需要大量指令数据集，但这些数据集仅对少数几种语言可用。本文探讨了在低资源场景中替代传统指令适应管道的方案。我们假设一种低资源语言的现实场景，其中仅可用以下资源：目标语言语料库、现有的开放权重多语言基础和指导性主干大语言模型，以及从指导性主干模型中合成生成的指令样本。我们为巴斯克语进行了全面的实验，系统地研究了这些组件的不同组合在基准测试和1,680名参与者的人类偏好评估中的表现。我们的结论表明，目标语言语料库是必不可少的，合成指令可以生成稳健的模型，最重要的是，使用指导调优模型作为主干优于使用未指导的基础模型。将主干扩展到Llama 3.1 Instruct 70B，我们的模型接近更大规模模型的前沿，而无需使用任何巴斯克语指令。我们发布了代码、模型、指令数据集和人类偏好，以支持未来低资源语言适应研究的完全可重复性。https://github.com/hitz-zentroa/latxa-instruct

Summary / 总结

This study investigates instructing large language models for low-resource languages, focusing on Basque. It explores the use of target language corpora, existing multilingual models, and synthetic instructions. Experiments involving 1,680 participants show that target language corpora are crucial, synthetic instructions lead to robust models, and using an instruction-tuned backbone model outperforms a non-instructed base model. The model using Llama 3.1 Instruct 70B as backbone nearly matches the performance of larger models without using any Basque instructions.

本文研究了在低资源语言场景下指导大型语言模型的方法，重点是巴斯克语。它探索了使用目标语言语料库、现有的多语言基模型和指令调优的骨干模型以及合成指令的替代传统指令适应管道的方法。1680名参与者的研究表明，目标语言语料库至关重要，合成指令能够生成稳健的模型，使用指令调优的骨干模型比使用非指令调优的基础模型效果更好。将模型扩展到Llama 3.1 Instruct 70B后，其性能接近更大规模模型的前沿，而无需使用任何巴斯克语指令。

Distributional Regression with Tabular Foundation Models: Evaluating Probabilistic Predictions via Proper Scoring Rules

Authors: Jonas Landsgesell, Pascal Knoll

First: 2026-03-09T10:38:01+00:00 · Latest: 2026-03-13T16:39:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Tabular foundation models such as TabPFN and TabICL already produce full predictive distributions, yet the benchmarks used to evaluate them (TabArena, TALENT, and others) still rely almost exclusively on point-estimate metrics (RMSE, $R^2$). This mismatch implicitly rewards models that elicit a good conditional mean while ignoring the quality of the predicted distribution. We make two contributions. First, we propose supplementing standard point metrics with proper scoring rules (CRPS, CRLS, and the Interval Score) and provide a head-to-head comparison of realTabPFNv2.5 and TabICLv2 with regards to some proper scoring rules across 20 OpenML regression datasets. Second, we show analytically and empirically that different proper scoring rules induce different model rankings and different inductive biases during training, even though each rule is individually minimized by the true distribution. Fine-tuning realTabPFNv2.5 with scoring rules not seen during pretraining (CRLS, $β=1.8$ energy score) yields consistent improvements on the corresponding metrics, confirming that the training loss shapes the model beyond what propriety alone guarantees. Together, these findings argue for (i) reporting distributional metrics in tabular regression benchmarks and (ii) making the training objective of foundation models adaptable (via fine-tuning or task-token conditioning) to the scoring rule relevant to the downstream decision problem.

中文标题/摘要

标题：基于表格基础模型的分布回归：通过合适的评分规则评估概率预测

诸如TabPFN和TabICL之类的表格基础模型已经生成了完整的预测分布，但用于评估它们的基准（TabArena、TALENT等）仍然几乎完全依赖于点估计指标（均方根误差、$R^2$）。这种不匹配隐式地奖励了那些产生良好条件均值但忽略预测分布质量的模型。我们做出了两项贡献。首先，我们建议用合适的评分规则（CRPS、CRLS和区间评分）补充标准的点估计指标，并在20个OpenML回归数据集上对真实TabPFNv2.5和TabICLv2在某些合适的评分规则下的表现进行了直接对比。其次，我们通过分析和实验证明，不同的合适的评分规则会诱导不同的模型排名和训练倾向，尽管每个规则都由真实分布最小化。通过预训练中未见过的评分规则（CRLS，$β=1.8$能量评分）微调真实TabPFNv2.5，可以一致地提高相应的指标，这证实了训练损失不仅由适当性保证，还塑造了模型。这些发现表明，（i）在表格回归基准中报告分布性指标，并且（ii）使基础模型的训练目标（通过微调或任务标记条件）适应下游决策问题相关的评分规则是必要的。

Summary / 总结

The paper addresses the issue of evaluating tabular foundation models like TabPFN and TabICL, which produce full predictive distributions, using benchmarks that focus on point-estimate metrics. It proposes using proper scoring rules such as CRPS, CRLS, and the Interval Score to evaluate model performance more comprehensively. The study compares realTabPFNv2.5 and TabICLv2 on 20 OpenML regression datasets and finds that different scoring rules can induce different model rankings and biases, even though they are individually optimized by the true distribution. Fine-tuning realTabPFNv2.5 with unseen scoring rules improves corresponding metrics, suggesting that the training objective should be adaptable to the scoring rule relevant to the downstream task.

该研究针对评估像TabPFN和TabICL这样的表格式基础模型时存在的问题，这些模型能够生成完整的预测分布，但现有的基准测试仍主要依赖点估计指标。作者提出使用适当的评分规则（CRPS、CRLS和区间评分）来评估模型性能，并在20个OpenML回归数据集上比较了realTabPFNv2.5和TabICLv2的表现。研究发现，不同的评分规则可以导致不同的模型排名和归纳偏见。通过在预训练中未见过的评分规则对realTabPFNv2.5进行微调，可以改善相应的指标性能，这表明训练目标应该根据下游任务相关的评分规则进行调整。

RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation

Authors: Yash Jangir, Yidi Zhang, Pang-Chi Lo, Kashu Yamazaki, Chenyu Zhang, Kuan-Hsun Tu, Tsung-Wei Ke, Lei Ke, Yonatan Bisk, Katerina Fragkiadaki

First: 2025-10-27T17:41:38+00:00 · Latest: 2026-03-13T16:29:34+00:00

Comments: Website: https://robotarenainf.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

The pursuit of robot generalists, agents capable of performing diverse tasks across diverse environments, demands rigorous and scalable evaluation. Yet real-world testing of robot policies remains fundamentally constrained: it is labor-intensive, slow, unsafe at scale, and difficult to reproduce. As policies expand in scope and complexity, these barriers only intensify, since defining "success" in robotics often hinges on nuanced human judgments of execution quality. We introduce RobotArena Infinity, a new benchmarking framework that overcomes these challenges by shifting vision-language-action (VLA) evaluation into large-scale simulated environments augmented with online human feedback. Leveraging advances in vision-language models, 2D-to-3D generative modeling, and differentiable rendering, our approach automatically converts video demonstrations from widely used robot datasets into simulated counterparts. Within these digital twins, we assess VLA policies using both automated vision-language-model-guided scoring and scalable human preference judgments collected from crowdworkers, transforming human involvement from tedious scene setup, resetting, and safety supervision into lightweight preference comparisons. To measure robustness, we systematically perturb simulated environments along multiple axes, including textures and object placements, stress-testing policy generalization under controlled variation. The result is a continuously evolving, reproducible, and scalable benchmark for real-world-trained robot manipulation policies, addressing a critical missing capability in today's robotics landscape.

中文标题/摘要

标题：RobotArena $\infty$：通过实到模拟转换实现可扩展的机器人基准测试

机器人通才，即能够在多种环境中执行多种任务的代理，的追求需要严格的可扩展评估。然而，机器人策略的现实世界测试仍然受到根本限制：它劳动密集型、速度慢、在大规模下不安全且难以重现。随着策略的范围和复杂性扩大，这些障碍只会加剧，因为机器人成功往往依赖于细微的人类判断。我们引入了RobotArena Infinity，这是一种新的基准测试框架，通过将视觉-语言-动作（VLA）评估转移到增强有人工反馈的大规模模拟环境中来克服这些挑战。利用视觉-语言模型、2D到3D生成建模和可微渲染的进展，我们的方法自动将广泛使用的机器人数据集中的视频演示转换为模拟对应物。在这些数字孪生中，我们使用自动化的视觉-语言模型指导评分和大规模的人类偏好判断来评估VLA策略，将人类参与从繁琐的场景设置、重置和安全监督转变为轻量级的偏好比较。为了衡量鲁棒性，我们系统地沿多个轴线对模拟环境进行扰动，包括纹理和物体放置，对策略在受控变化下的泛化能力进行压力测试。结果是一个不断演进、可重现且可扩展的基准测试，用于现实世界训练的机器人操作策略，填补了当今机器人领域的一项关键缺失能力。

Summary / 总结

RobotArena $\infty$ addresses the need for scalable evaluation of robot policies by converting real-world video demonstrations into simulated environments using advances in vision-language models and 2D-to-3D generative modeling. It assesses vision-language-action policies through automated scoring and human preference judgments collected from crowdworkers, and evaluates policy robustness by systematically perturbing simulated environments. Key findings include a continuously evolving, reproducible, and scalable benchmark for real-world-trained robot manipulation policies.

RobotArena Infinity 通过将真实世界的视频演示转换为模拟环境，并结合在线的人类反馈来评估机器人通用性。该方法利用视觉语言模型和可微渲染技术来自动化生成模拟副本。关键发现包括系统地扰动模拟环境以测试策略的鲁棒性和泛化能力，从而形成一个可扩展且可重复的机器人操作策略基准。

When Right Meets Wrong: Bilateral Context Conditioning with Reward-Confidence Correction for GRPO

Authors: Yu Li, Tian Lan, Zhengling Qi

First: 2026-03-13T16:25:02+00:00 · Latest: 2026-03-13T16:25:02+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Group Relative Policy Optimization (GRPO) has emerged as an effective method for training reasoning models. While it computes advantages based on group mean, GRPO treats each output as an independent sample during the optimization and overlooks a vital structural signal: the natural contrast between correct and incorrect solutions within the same group, thus ignoring the rich, comparative data that could be leveraged by explicitly pitting successful reasoning traces against failed ones. To capitalize on this, we present a contrastive reformulation of GRPO, showing that the GRPO objective implicitly maximizes the margin between the policy ratios of correct and incorrect samples. Building on this insight, we propose Bilateral Context Conditioning (BICC), a mechanism that allows the model to cross-reference successful and failed reasoning traces during the optimization, enabling a direct information flow across samples. We further introduce Reward-Confidence Correction (RCC) to stabilize training by dynamically adjusts the advantage baseline in GRPO using reward-confidence covariance derived from the first-order approximation of the variance-minimizing estimator. Both mechanisms require no additional sampling or auxiliary models and can be adapted to all GRPO variants. Experiments on mathematical reasoning benchmarks demonstrate consistent improvements across comprehensive models and algorithms. Code is available at \href{https://github.com/Skylanding/BiCC}{https://github.com/Skylanding/BiCC}.

中文标题/摘要

标题：当正确遇到错误：基于奖励-信心校正的双边上下文条件对GRPO的改进

组相对策略优化（GRPO）已成为训练推理模型的有效方法。虽然它基于组均值计算优势，但在优化过程中将每个输出视为独立样本，忽略了同一组内正确与错误解之间自然对比这一重要结构信号，从而忽视了通过明确对比成功推理轨迹与失败轨迹所能利用的丰富比较数据。为利用这一信号，我们提出了GRPO的对比重构，表明GRPO目标隐式最大化了正确和错误样本策略比率之间的差距。基于这一洞察，我们提出了双边上下文条件（BICC）机制，允许模型在优化过程中交叉参考成功和失败的推理轨迹，实现样本间的直接信息流。我们还引入了奖励-信心校正（RCC），通过从一阶近似方差最小化估计器导出的奖励-信心协方差动态调整GRPO的优势基线，以稳定训练。这两种机制无需额外采样或辅助模型，并可适应所有GRPO变体。在数学推理基准测试上的实验表明，这两种机制在综合模型和算法中均能实现一致改进。代码可在https://github.com/Skylanding/BiCC 获取。

Summary / 总结

The research aims to enhance Group Relative Policy Optimization (GRPO) by leveraging the contrast between correct and incorrect reasoning traces. The method introduces Bilateral Context Conditioning (BICC) to allow models to cross-reference successful and failed reasoning traces during optimization, and Reward-Confidence Correction (RCC) to stabilize training. Experiments show consistent improvements across various mathematical reasoning benchmarks.

研究旨在通过利用正确和错误推理痕迹之间的对比来增强Group Relative Policy Optimization (GRPO)。方法引入了Bilateral Context Conditioning (BICC)，使其在优化过程中能够交叉参考成功的和失败的推理痕迹，并引入了Reward-Confidence Correction (RCC)来稳定训练。实验结果显示，在各种模型和算法上，特别是在数学推理基准测试中，均取得了持续改进。

Steve-Evolving: Open-World Embodied Self-Evolution via Fine-Grained Diagnosis and Dual-Track Knowledge Distillation

Authors: Zhengwei Xie, Zhisheng Chen, Ziyan Weng, Tingyu Wu, Chenglong Li, Vireo Zhang, Kun Wang

First: 2026-03-13T16:23:34+00:00 · Latest: 2026-03-13T16:23:34+00:00

Abs · PDF · Code1 · Code2

Abstract

Open-world embodied agents must solve long-horizon tasks where the main bottleneck is not single-step planning quality but how interaction experience is organized and evolved. To this end, we present Steve-Evolving, a non-parametric self-evolving framework that tightly couples fine-grained execution diagnosis with dual-track knowledge distillation in a closed loop. The method follows three phases: Experience Anchoring, Experience Distillation, and Knowledge-Driven Closed-Loop Control. In detail, Experience Anchoring solidifies each subgoal attempt into a structured experience tuple with a fixed schema (pre-state, action, diagnosis-result, and post-state) and organizes it in a three-tier experience space with multi-dimensional indices (e.g., condition signatures, spatial hashing, and semantic tags) plus rolling summarization for efficient and auditable recall. To ensure sufficient information density for attribution, the execution layer provides compositional diagnosis signals beyond binary outcomes, including state-difference summaries, enumerated failure causes, continuous indicators, and stagnation/loop detection. Moreover, successful trajectories of Experience Distillation are generalized into reusable skills with explicit preconditions and verification criteria, while failures are distilled into executable guardrails that capture root causes and forbid risky operations at both subgoal and task granularities. Besides, Knowledge-Driven Closed-Loop Control retrieved skills and guardrails are injected into an LLM planner, and diagnosis-triggered local replanning updates the active constraints online, forming a continual evolution process without any model parameter updates. Experiments on the long-horizon suite of Minecraft MCU demonstrate consistent improvements over static-retrieval baselines.

中文标题/摘要

标题：Steve-Evolving: 开放世界具身自我进化框架通过细粒度执行诊断和双轨知识蒸馏

开放世界具身智能体必须解决长期任务，其中主要瓶颈不是单步规划质量，而是交互经验的组织和进化。为此，我们提出了Steve-Evolving，一种非参数化的自我进化框架，该框架将细粒度执行诊断与双轨知识蒸馏紧密耦合在一个闭环中。该方法分为三个阶段：经验锚定、经验蒸馏和知识驱动的闭环控制。具体而言，经验锚定将每个子目标尝试固化为具有固定模式的结构化经验元组（前状态、动作、诊断结果和后状态），并将其组织在具有多维索引（例如，条件签名、空间哈希和语义标签）的三层经验空间中，加上滚动总结，以实现高效和可审计的检索。为了确保足够的信息密度以进行归因，执行层提供了超出二元结果的组合诊断信号，包括状态差异总结、列举的失败原因、连续指标以及停滞/循环检测。此外，经验蒸馏成功的轨迹被泛化为可重用的技能，带有明确的前提条件和验证标准，而失败则被蒸馏为可执行的护栏，捕捉根本原因并禁止子目标和任务粒度上的风险操作。此外，知识驱动的闭环控制检索技能和护栏，并将其注入LLM规划器中，诊断触发的局部重规划在线更新活动约束，形成一个持续进化过程，而无需任何模型参数更新。在Minecraft MCU的长期任务套件上的实验表明，与静态检索基线相比，该方法具有一致的改进。

Summary / 总结

Steve-Evolving is a non-parametric self-evolving framework for open-world embodied agents that addresses the challenge of organizing and evolving interaction experience through fine-grained execution diagnosis and dual-track knowledge distillation. It consists of three phases: Experience Anchoring, Experience Distillation, and Knowledge-Driven Closed-Loop Control. The framework solidifies each subgoal attempt into structured experience tuples and organizes them in a three-tier space for efficient recall. It also provides compositional diagnosis signals and generalizes successful trajectories into reusable skills while distilling failures into executable guardrails. The final phase injects these skills and guardrails into an LLM planner for continual evolution. Experiments show consistent improvements over static-retrieval baselines in long-horizon tasks in Minecraft MCU.

Steve-Evolving 是一个非参数自演化框架，旨在通过细粒度执行诊断和双轨知识蒸馏来解决开放世界体态代理组织和演化交互经验的挑战。该框架包含三个阶段：经验锚定、经验蒸馏和知识驱动的闭环控制。它将每个子目标尝试固化为结构化经验元组，并组织在一个三层空间中以实现高效召回。此外，它还提供组成诊断信号，并将成功的轨迹泛化为可重用技能，同时将失败蒸馏为可执行的护栏。最终阶段将这些技能和护栏注入到LLM规划器中，以实现持续演化。实验结果表明，在Minecraft MCU的长期任务中，该方法比静态检索基线表现出持续改进。

Developing the PsyCogMetrics AI Lab to Evaluate Large Language Models and Advance Cognitive Science -- A Three-Cycle Action Design Science Study

Authors: Zhiye Jin, Yibai Li, K. D. Joshi, Xuefei, Deng, Xiaobing, Li

Venue: Proceedings of the 59th Hawaii International Conference on System Sciences (HICSS), January 2026, pp. 6952-6961

First: 2026-03-13T16:17:45+00:00 · Latest: 2026-03-13T16:17:45+00:00

Comments: 10 pages. Prepared: April 2025; submitted: June 15, 2025; accepted: August 2025. In: Proceedings of the 59th Hawaii International Conference on System Sciences (HICSS 2026), January 2026

Abs · PDF · Code1 · Code2

Abstract

This study presents the development of the PsyCogMetrics AI Lab (psycogmetrics.ai), an integrated, cloud-based platform that operationalizes psychometric and cognitive-science methodologies for Large Language Model (LLM) evaluation. Framed as a three-cycle Action Design Science study, the Relevance Cycle identifies key limitations in current evaluation methods and unfulfilled stakeholder needs. The Rigor Cycle draws on kernel theories such as Popperian falsifiability, Classical Test Theory, and Cognitive Load Theory to derive deductive design objectives. The Design Cycle operationalizes these objectives through nested Build-Intervene-Evaluate loops. The study contributes a novel IT artifact, a validated design for LLM evaluation, benefiting research at the intersection of AI, psychology, cognitive science, and the social and behavioral sciences.

中文标题/摘要

标题：开发PsyCogMetrics AI实验室以评估大型语言模型并推进认知科学——一项三轮行动设计科学研究

本研究介绍了PsyCogMetrics AI实验室（psycogmetrics.ai）的开发，这是一个集成的、基于云的平台，将心理测量学和认知科学方法应用于大型语言模型（LLM）评估。本研究作为三轮行动设计科学研究进行，相关性循环识别当前评估方法的关键局限性和未满足的利益相关者需求。严谨性循环借鉴了如波普尔可证伪性、经典测试理论和认知负荷理论等核心理论，推导出演绎设计目标。设计循环通过嵌套的构建-干预-评估循环将这些目标具体化。本研究贡献了一个新颖的IT工具，一个验证过的LLM评估设计，为人工智能、心理学、认知科学以及社会和行为科学的交叉研究提供益处。

LLMs Can Infer Political Alignment from Online Conversations

Authors: Byunghwee Lee, Sangyeon Kim, Filippo Menczer, Yong-Yeol Ahn, Haewoon Kwak, Jisun An

First: 2026-03-11T19:26:04+00:00 · Latest: 2026-03-13T16:15:23+00:00

Comments: 56 pages; 4 figures in the main text and 18 supplementary figures, 11 supplementary tables

Abs · PDF · Code1 · Code2

Abstract

Due to the correlational structure in our traits such as identities, cultures, and political attitudes, seemingly innocuous preferences like following a band or using a specific slang can reveal private traits. This possibility, especially when combined with massive, public social data and advanced computational methods, poses a fundamental privacy risk. As our data exposure online and the rapid advancement of AI are increasing the risk of misuse, it is critical to understand the capacity of large language models (LLMs) to exploit such potential. Here, using online discussions on DebateOrg and Reddit, we show that LLMs can reliably infer hidden political alignment, significantly outperforming traditional machine learning models. Prediction accuracy further improves as we aggregate multiple text-level inferences into a user-level prediction, and as we use more politics-adjacent domains. We demonstrate that LLMs leverage words that are highly predictive of political alignment while not being explicitly political. Our findings underscore the capacity and risks of LLMs for exploiting socio-cultural correlates.

中文标题/摘要

标题：大语言模型可以从在线对话中推断出政治倾向

由于我们的特质如身份、文化以及政治态度之间存在相关结构，看似无害的偏好如跟随某个乐队或使用特定俚语可以揭示出私人特质。当这种可能性与大量公开的社会数据和先进的计算方法结合时，会构成根本性的隐私风险。随着我们在线数据的暴露增加以及人工智能的快速发展，理解大语言模型（LLMs）利用这种潜在性的能力变得至关重要。在这里，我们使用DebateOrg和Reddit上的在线讨论，展示了LLMs可以可靠地推断出隐藏的政治倾向，显著优于传统的机器学习模型。预测准确性随着我们从文本层面的推断聚合到用户层面的预测，以及使用更多与政治相关的领域而提高。我们证明LLMs利用了高度预测政治倾向的词汇，但这些词汇并不是明确的政治词汇。我们的研究结果强调了LLMs利用社会文化相关性的能力和风险。

Summary / 总结

The research explores the ability of large language models (LLMs) to infer political alignment from online conversations, highlighting the privacy risks associated with the correlational structure of personal traits. Using online discussions from DebateOrg and Reddit, the study shows that LLMs can reliably predict political alignment better than traditional machine learning models, with accuracy improving when aggregating multiple inferences and using more politics-related data. The findings indicate that LLMs use non-political words to make these predictions, underscoring the potential risks of exploiting socio-cultural correlates through advanced computational methods.

研究探讨了大型语言模型（LLMs）从在线对话中推断政治倾向的能力，突显了个人特质相关性结构带来的隐私风险。通过来自DebateOrg和Reddit的在线讨论，研究显示LLMs在预测政治倾向方面比传统机器学习模型表现更好，准确性在聚合多个推断和使用更多政治相关数据时提高。研究结果表明，LLMs使用非政治词汇进行这些预测，强调了通过先进计算方法利用社会文化相关性的潜在风险。

Geometry-Guided Camera Motion Understanding in VideoLLMs

Authors: Haoan Feng, Sri Harsha Musunuri, Guan-Ming Su

First: 2026-03-13T16:13:09+00:00 · Latest: 2026-03-13T16:13:09+00:00

Comments: 10 pages, 7 figures, supplementary included

Abs · PDF · Code1 · Code2

Abstract

Camera motion is a fundamental geometric signal that shapes visual perception and cinematic style, yet current video-capable vision-language models (VideoLLMs) rarely represent it explicitly and often fail on fine-grained motion primitives. We address this gap with a framework of $\textbf{benchmarking}$, $\textbf{diagnosis}$, and $\textbf{injection}$. We curate $\textbf{CameraMotionDataset}$, a large-scale synthetic dataset with explicit camera control, formulate camera motion as constraint-aware multi-label recognition, and construct a VQA benchmark--$\textbf{CameraMotionVQA}$. Across diverse off-the-shelf VideoLLMs, we observe substantial errors in recognizing camera motion primitives. Probing experiments on a Qwen2.5-VL vision encoder suggest that camera motion cues are weakly represented, especially in deeper ViT blocks, helping explain the observed failure modes. To bridge this gap without costly training or fine-tuning, we propose a lightweight, model-agnostic pipeline that extracts geometric camera cues from 3D foundation models (3DFMs), predicts constrained motion primitives with a temporal classifier, and injects them into downstream VideoLLM inference via structured prompting. Experiments demonstrate improved motion recognition and more camera-aware model responses, highlighting geometry-driven cue extraction and structured prompting as practical steps toward a camera-aware VideoLLM and VLA system. The dataset and benchmark is publicly available at https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark.

中文标题/摘要

标题：视频LLMs中的摄像机运动几何引导理解

摄像机运动是塑造视觉感知和电影风格的基本几何信号，但当前的视频能力视觉语言模型（VideoLLMs）很少明确表示它，并且经常在细微的运动原语上失败。我们通过一个框架来解决这一差距，该框架包括基准测试、诊断和注入。我们编纂了一个名为$\textbf{CameraMotionDataset}$的大规模合成数据集，其中包含明确的摄像机控制，将摄像机运动形式化为约束感知的多标签识别，并构建了一个VQA基准——$\textbf{CameraMotionVQA}$。在多种现成的VideoLLMs中，我们观察到在识别摄像机运动原语方面存在大量错误。对Qwen2.5-VL视觉编码器的探针实验表明，摄像机运动提示在视觉编码器中弱表示，尤其是在更深层次的ViT块中，这有助于解释观察到的失败模式。为了在不进行昂贵的训练或微调的情况下弥合这一差距，我们提出了一种轻量级、模型无关的管道，该管道从3D基础模型（3DFMs）中提取几何摄像机提示，使用时间分类器预测受限运动原语，并通过结构化提示将它们注入下游VideoLLM推理中。实验表明，运动识别得到改善，模型响应更加摄像机意识，突出了几何驱动提示提取和结构化提示作为朝着摄像机意识VideoLLM和VLA系统实践步骤的重要性。数据集和基准可在https://hf.co/datasets/fengyee/camera-motion-dataset-and-benchmark/公开获取。

Summary / 总结

The paper addresses the lack of explicit representation of camera motion in VideoLLMs by introducing a framework of benchmarking, diagnosis, and injection. It curates the CameraMotionDataset, formulates camera motion as constraint-aware multi-label recognition, and constructs the CameraMotionVQA benchmark. Experiments show that VideoLLMs have significant errors in recognizing camera motion primitives, and a lightweight, model-agnostic pipeline is proposed to extract geometric camera cues and inject them into VideoLLMs via structured prompting, improving motion recognition and model responses. The dataset and benchmark are publicly available.

论文通过引入基准测试、诊断和注入的框架，解决了当前VideoLLMs中缺乏显式摄像机运动表示的问题。它构建了一个大规模合成数据集CameraMotionDataset，并构造了一个基于VQA的基准测试CameraMotionVQA来评估摄像机运动识别。研究发现，VideoLLMs在识别摄像机运动基本元素时存在显著错误，并指出摄像机运动线索在更深的ViT块中表示较弱。为了改进这一点，作者提出了一种轻量级、模型无关的管道，从3D基础模型中提取几何摄像机线索，使用时间分类器预测约束运动基本元素，并通过结构化提示将它们注入到下游VideoLLM推理中，从而提高了运动识别并使模型更具摄像机意识。

AdaBoN: Adaptive Best-of-N Alignment

Authors: Vinod Raman, Hilal Asi, Satyen Kale

First: 2025-05-17T15:24:48+00:00 · Latest: 2026-03-13T16:07:57+00:00

Comments: 25 pages

Abs · PDF · Code1 · Code2

Abstract

Recent advances in test-time alignment methods, such as Best-of-N sampling, offer a simple and effective way to steer language models (LMs) toward preferred behaviors using reward models (RM). However, these approaches can be computationally expensive, especially when applied uniformly across prompts without accounting for differences in alignment difficulty. In this work, we propose a prompt-adaptive strategy for Best-of-N alignment that allocates inference-time compute more efficiently. Motivated by latency concerns, we develop a two-stage algorithm: an initial exploratory phase estimates the reward distribution for each prompt using a small exploration budget, and a second stage adaptively allocates the remaining budget using these estimates. Our method is simple, practical, and compatible with any LM-RM combination. Empirical results on prompts from the AlpacaEval, HH-RLHF, and PKU-SafeRLHF datasets for 12 LM/RM pairs and 50 different batches of prompts show that our adaptive strategy outperforms the uniform allocation with the same inference budget. Moreover, we show that our adaptive strategy remains competitive against uniform allocations with 20 percent larger inference budgets and improves in performance as the batch size grows.

中文标题/摘要

标题：AdaBoN: 自适应最佳-of-N对齐

最近在测试时对齐方法方面的进展，例如最佳-of-N采样，提供了一种简单而有效的方法，通过奖励模型（RM）引导语言模型（LMs）向期望的行为靠拢。然而，这些方法在应用时可能会非常耗计算资源，尤其是在没有考虑提示之间对齐难度差异的情况下，进行均匀应用时尤为如此。在本文中，我们提出了一种针对最佳-of-N对齐的提示自适应策略，以更有效地分配推理时间计算资源。受延迟问题的启发，我们开发了一种两阶段算法：初始探索阶段使用少量的探索预算估计每个提示的奖励分布，第二阶段则根据这些估计值自适应地分配剩余预算。我们的方法简单、实用，并且与任何LM-RM组合兼容。在AlpacaEval、HH-RLHF和PKU-SafeRLHF数据集上的12个LM/RM对和50个不同批次的提示的实验结果表明，我们的自适应策略在相同的推理预算下优于均匀分配。此外，我们还展示了我们的自适应策略在推理预算大20%的情况下仍具有竞争力，并且随着批次大小的增加，性能有所提升。

Summary / 总结

This paper proposes AdaBoN, an adaptive Best-of-N alignment method that allocates inference-time compute more efficiently by using a two-stage algorithm. The initial stage estimates the reward distribution for each prompt, and the second stage adaptively allocates the remaining budget based on these estimates. Experiments on various datasets and LM/RM pairs demonstrate that AdaBoN outperforms uniform allocation with the same budget and remains competitive with a 20% larger budget as batch size increases.

本文提出AdaBoN，一种适应性策略，通过根据提示的对齐难度分配推理时间计算，解决了均匀Best-of-N对齐方法的计算效率问题。该方法包括一个探索阶段来估计奖励分布和一个自适应阶段来分配剩余预算。实验结果表明，AdaBoN在相同预算下优于均匀分配，并且在预算增加20%的情况下仍具有竞争力，尤其是在批次大小增加时表现出色。

Panoramic Multimodal Semantic Occupancy Prediction for Quadruped Robots

Authors: Guoqiang Zhao, Zhe Yang, Sheng Wu, Fei Teng, Mengfei Duan, Yuanfan Zheng, Kai Luo, Kailun Yang

First: 2026-03-13T16:04:33+00:00 · Latest: 2026-03-13T16:04:33+00:00

Comments: The dataset and code will be publicly released at https://github.com/SXDR/PanoMMOcc

Abs · PDF · Code1 · Code2 · Code3

Abstract

Panoramic imagery provides holistic 360° visual coverage for perception in quadruped robots. However, existing occupancy prediction methods are mainly designed for wheeled autonomous driving and rely heavily on RGB cues, limiting their robustness in complex environments. To bridge this gap, (1) we present PanoMMOcc, the first real-world panoramic multimodal occupancy dataset for quadruped robots, featuring four sensing modalities across diverse scenes. (2) We propose a panoramic multimodal occupancy perception framework, VoxelHound, tailored for legged mobility and spherical imaging. Specifically, we design (i) a Vertical Jitter Compensation (VJC) module to mitigate severe viewpoint perturbations caused by body pitch and roll during mobility, enabling more consistent spatial reasoning, and (ii) an effective Multimodal Information Prompt Fusion (MIPF) module that jointly leverages panoramic visual cues and auxiliary modalities to enhance volumetric occupancy prediction. (3) We establish a benchmark based on PanoMMOcc and provide detailed data analysis to enable systematic evaluation of perception methods under challenging embodied scenarios. Extensive experiments demonstrate that VoxelHound achieves state-of-the-art performance on PanoMMOcc (+4.16%} in mIoU). The dataset and code will be publicly released to facilitate future research on panoramic multimodal 3D perception for embodied robotic systems at https://github.com/SXDR/PanoMMOcc, along with the calibration tools released at https://github.com/losehu/CameraLiDAR-Calib.

中文标题/摘要

标题：全景多模态语义占用预测技术在四足机器人中的应用

全景图像为四足机器人提供了360°的整体视觉覆盖，用于感知。然而，现有的占用预测方法主要针对轮式自主驾驶设计，高度依赖RGB信息，这限制了它们在复杂环境中的鲁棒性。为解决这一问题，(1) 我们提出了PanoMMOcc，这是首个针对四足机器人的全景多模态占用数据集，包含四种传感模态，适用于多种场景。(2) 我们提出了一种全景多模态占用感知框架VoxelHound，专为腿足移动和球形成像设计。具体来说，我们设计了(i) 一种垂直抖动补偿模块(VJC)，以减轻移动过程中由身体俯仰和滚动引起的严重视角偏差，从而实现更一致的空间推理；(ii) 一种有效的多模态信息提示融合模块(MIPF)，该模块联合利用全景视觉线索和辅助模态，以增强体素占用预测。(3) 我们基于PanoMMOcc建立了基准，并提供了详细的数据分析，以在具有挑战性的实体场景中系统评估感知方法。广泛的实验表明，VoxelHound在PanoMMOcc上的表现优于现有方法(+4.16%的mIoU)。数据集和代码将在https://github.com/SXDR/PanoMMOcc公开发布，以促进未来对全景多模态3D感知在实体机器人系统中的研究，同时发布的还有校准工具https://github.com/losehu/CameraLiDAR-Calib。

Summary / 总结

The research aims to improve occupancy prediction for quadruped robots using panoramic multimodal sensing. The authors introduce PanoMMOcc, a new dataset with four sensing modalities for diverse scenes, and propose VoxelHound, a framework that includes a VJC module for viewpoint compensation and an MIPF module for multimodal information fusion. Experiments show that VoxelHound outperforms existing methods by 4.16% in mIoU on PanoMMOcc.

研究旨在通过全景图像提高四足机器人的占用率预测，解决现有方法依赖RGB线索的局限性。研究引入了PanoMMOcc数据集，包含四种传感模态，并提出了VoxelHound框架，该框架包含用于视点补偿的VJC模块和用于多模态信息融合的MIPF模块。实验结果显示，VoxelHound在PanoMMOcc上的mIoU表现优于现有方法，提高了4.16%。

FoV-Net: Rotation-Invariant CAD B-rep Learning via Field-of-View Ray Casting

Authors: Matteo Ballegeer, Dries F. Benoit

Venue: CVPR 2026

First: 2026-02-27T15:21:52+00:00 · Latest: 2026-03-13T16:04:29+00:00

Comments: Manuscript accepted at CVPR 2026

Abs · PDF · Code1 · Code2

Abstract

Learning directly from boundary representations (B-reps) has significantly advanced 3D CAD analysis. However, state-of-the-art B-rep learning methods rely on absolute coordinates and normals to encode global context, making them highly sensitive to rotations. Our experiments reveal that models achieving over 95% accuracy on aligned benchmarks can collapse to as low as 10% under arbitrary $\mathbf{SO}(3)$ rotations. To address this, we introduce FoV-Net, the first B-rep learning framework that captures both local surface geometry and global structural context in a rotation-invariant manner. Each face is represented by a Local Reference Frame (LRF) UV-grid that encodes its local surface geometry, and by Field-of-View (FoV) grids that capture the surrounding 3D context by casting rays and recording intersections with neighboring faces. Lightweight CNNs extract per-face features, which are propagated over the B-rep graph using a graph attention network. FoV-Net achieves state-of-the-art performance on B-rep classification and segmentation benchmarks, demonstrating robustness to arbitrary rotations while also requiring less training data to achieve strong results.

中文标题/摘要

标题：FoV-Net：通过视场射线投射学习旋转不变的CAD B-rep

直接从边界表示（B-reps）学习显著推进了3D CAD分析。然而，最先进的B-rep学习方法依赖绝对坐标和法线来编码全局上下文，使其对旋转非常敏感。我们的实验表明，在对齐基准上达到95%以上准确率的模型，在任意$\mathbf{SO}(3)$旋转下可能会下降到10%以下。为了解决这一问题，我们提出了FoV-Net，这是第一个以旋转不变的方式同时捕捉局部表面几何和全局结构上下文的B-rep学习框架。每个面由一个局部参考框架（LRF）UV网格表示，编码其局部表面几何，以及通过投射射线并记录与相邻面的交点来捕捉周围3D上下文的视场（FoV）网格。轻量级CNN提取每个面的特征，通过图注意力网络在B-rep图上进行传播。FoV-Net在B-rep分类和分割基准上达到了最先进的性能，展示了对任意旋转的鲁棒性，同时需要较少的训练数据即可获得良好的结果。

Summary / 总结

The research aims to improve the robustness of 3D CAD analysis by addressing the sensitivity of existing B-rep learning methods to rotations. FoV-Net, a novel B-rep learning framework, captures both local surface geometry and global structural context in a rotation-invariant manner. It uses Local Reference Frame UV-grids for local geometry and Field-of-View grids for global context through ray casting. FoV-Net outperforms existing methods on B-rep classification and segmentation benchmarks and shows robustness to arbitrary rotations while requiring less training data.

研究旨在通过解决现有B-rep学习方法对旋转敏感的问题来改进3D CAD分析。FoV-Net作为一种旋转不变框架，同时捕捉局部表面几何和全局结构上下文。它使用局部参考框架UV网格和视场网格来编码面及其周围的环境。FoV-Net在B-rep分类和分割基准测试中表现出色，并且在获得良好结果时需要较少的训练数据，同时显示出对任意旋转的鲁棒性。

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Authors: Usman Anwar, Julianna Piskorz, David D. Baek, David Africa, Jim Weatherall, Max Tegmark, Christian Schroeder de Witt, Mihaela van der Schaar, David Krueger

First: 2026-02-26T16:27:24+00:00 · Latest: 2026-03-13T15:55:32+00:00

Comments: First two authors contributed equally

Abs · PDF · Code1 · Code2

Abstract

Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents' observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.

中文标题/摘要

标题：基于决策理论的隐写术形式化及其在大语言模型监控中的应用

大型语言模型开始展现出隐写术能力。这种能力可能使不一致的模型逃避监管机制。然而，缺乏对这种行为进行原理性的检测和量化的方法。经典的隐写术定义和基于它们的检测方法需要已知的非隐写术信号的参考分布。对于大语言模型中的隐写术推理情况，知道这样的参考分布是不可行的；这使得这些方法不适用。我们提出了一种替代的隐写术的\textbf{决策理论视角}。我们的核心见解是，隐写术在能够和不能解码隐写内容的代理之间创造了可用信息的不对称性，而这种潜在的不对称性可以从代理的可观察行为中推断出来。为了形式化这一观点，我们引入了广义的$\mathcal{V}$-信息：一种衡量输入中可用信息量的功利主义框架。我们用它来定义\textbf{隐写术差距}——一种通过比较能够和不能解码隐写内容的代理的下游效用来量化隐写术的度量标准。我们通过实验证明了该形式化方法，并展示了它可以用于检测、量化和缓解大语言模型中的隐写术推理。

Summary / 总结

The paper addresses the challenge of detecting steganographic capabilities in large language models (LLMs) by proposing a decision-theoretic approach. It introduces a new concept called the steganographic gap, which measures the difference in utility between agents who can and cannot decode hidden content within a steganographic signal. The authors validate their method through empirical experiments, demonstrating its effectiveness in detecting and quantifying steganographic reasoning in LLMs.

论文提出了一种决策理论方法来检测大型语言模型（LLM）的隐写术能力。它引入了广义$\mathcal{V}$-信息的概念来衡量可用信息，并定义了隐写术差距来量化能够和不能解码隐藏内容的代理之间的信息不对称。实验表明，这种方法可以有效检测、量化和缓解LLM中的隐写术推理。

EvoLMM: Self-Evolving Large Multimodal Models with Continuous Rewards

Authors: Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Khan

Venue: CVPR 2026

First: 2025-11-20T18:59:54+00:00 · Latest: 2026-03-13T15:53:44+00:00

Comments: CVPR 2026 (findings)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in large multimodal models (LMMs) have enabled impressive reasoning and perception abilities, yet most existing training pipelines still depend on human-curated data or externally verified reward models, limiting their autonomy and scalability. In this work, we strive to improve LMM reasoning capabilities in a purely unsupervised fashion (without any annotated data or reward distillation). To this end, we propose a self-evolving framework, named EvoLMM, that instantiates two cooperative agents from a single backbone model: a Proposer, which generates diverse, image-grounded questions, and a Solver, which solves them through internal consistency, where learning proceeds through a continuous self-rewarding process. This dynamic feedback encourages both the generation of informative queries and the refinement of structured reasoning without relying on ground-truth or human judgments. When using the popular Qwen2.5-VL as the base model, our EvoLMM yields consistent gains upto $\sim$3\% on multimodal math-reasoning benchmarks, including ChartQA, MathVista, and MathVision, using only raw training images. We hope our simple yet effective approach will serve as a solid baseline easing future research in self-improving LMMs in a fully-unsupervised fashion. Our code and models are available at https://github.com/mbzuai-oryx/EvoLMM.

中文标题/摘要

标题：EvoLMM：自我进化的大型多模态模型及其连续奖励

近年来，大型多模态模型（LMMs）的发展使其实现了令人印象深刻的推理和感知能力，但大多数现有的训练管道仍然依赖于人工标注的数据或外部验证的奖励模型，这限制了它们的自主性和可扩展性。在本工作中，我们旨在以完全无监督的方式（无需任何标注数据或奖励蒸馏）提高LMM的推理能力。为此，我们提出了一种自我进化的框架，名为EvoLMM，该框架从单一骨干模型中实例化两个协作的代理：一个提案者，它生成多样化的、基于图像的问题；一个解决者，它通过内部一致性解决这些问题，其中学习过程通过连续的自我奖励机制进行。这种动态反馈机制鼓励生成信息性的问题并改进结构化的推理，而无需依赖真实数据或人工判断。当使用流行的Qwen2.5-VL作为基础模型时，我们的EvoLMM在仅使用原始训练图像的情况下，在多模态数学推理基准测试中，包括ChartQA、MathVista和MathVision，取得了高达约3%的一致性收益。我们希望我们的简单而有效的方法能够为完全无监督的自我改进LMMs的研究提供一个坚实的基线。我们的代码和模型可在https://github.com/mbzuai-oryx/EvoLMM上获得。

Summary / 总结

EvoLMM is a self-evolving framework for large multimodal models (LMMs) that improves reasoning capabilities in an unsupervised manner. It uses a single backbone model to instantiate two agents: a Proposer that generates diverse image-grounded questions, and a Solver that solves them through internal consistency. This continuous self-rewarding process enhances both query generation and structured reasoning. When tested on multimodal math-reasoning benchmarks, EvoLMM achieved consistent gains of up to 3% using only raw training images, without relying on annotated data or human judgments.

EvoLMM 是一种自进化的大型多模态模型框架，不依赖于标注数据或外部奖励来提升推理能力。它使用单一的骨干模型来实例化两个代理：生成器生成多样化的图像相关问题，解决器通过内部一致性解决这些问题。这一持续的自我奖励过程增强了查询生成和结构化推理。在多模态数学推理基准测试中，EvoLMM 使用仅有的原始训练图像实现了高达 3% 的一致改进。

History

20260316_0333 20260315_0330 20260314_0336 20260313_0346 20260312_0346 20260311_0342 20260310_0345 20260309_0327 20260308_0327 20260307_0339 20260306_0356 20260305_0342 20260303_0342 20260301_0326 20260228_0342 20260227_0349 20260226_0357 20260225_0353 20260224_0400 20260223_0328 20260222_0327 20260221_0340 20260220_0343 20260219_0354 20260218_0353 20260217_0336 20260216_0328 20260215_0327 20260213_0358 20260212_0400 20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553