Comparing Developer and LLM Biases in Code Evaluation
Authors: Aditya Mittal, Ryan Shar, Zichu Wu, Shyam Agarwal, Tongshuang Wu, Chris Donahue, Ameet Talwalkar, Wayne Chi, Valerie Chen
First: 2026-03-25T17:56:55+00:00 · Latest: 2026-03-25T17:56:55+00:00
Abstract
As LLMs are increasingly used as judges in code applications, they should be evaluated in realistic interactive settings that capture partial context and ambiguous intent. We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges' ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and models weigh each item. Across three modalities -- chat-based programming, IDE autocompletion, and instructed code editing -- we use TRACE to measure how well LLM judges align with developer preferences. Among 13 different models, the best judges underperform human annotators by 12-23%. TRACE identifies 35 significant sources of misalignment between humans and judges across interaction modalities, the majority of which correspond to existing software engineering code quality criteria. For example, in chat-based coding, judges are biased towards longer code explanations while humans prefer shorter ones. We find significant misalignment on the majority of existing code quality dimensions, showing alignment gaps between LLM judges and human preference in realistic coding applications.
中文标题/摘要
标题:开发者与LLM在代码评估中的偏见比较
随着LLM在代码应用中越来越多地作为评判者使用,它们应该在能够捕捉部分上下文和模糊意图的现实交互环境中进行评估。我们提出了TRACE(代码评估评分标准分析工具),这是一种框架,用于评估LLM评判者预测人类偏好的能力,并自动提取评分标准项目以揭示人类和模型在评估每个项目时的系统性偏见。在基于聊天的编程、IDE自动完成和指令代码编辑三种模式下,我们使用TRACE来衡量LLM评判者与开发人员偏好的契合度。在13种不同模型中,最佳评判者的表现比人类注释者低12-23%。TRACE识别出35种在不同交互模式下人类与评判者之间显著的不一致来源,其中大多数与现有的软件工程代码质量标准相对应。例如,在基于聊天的编程中,评判者倾向于更长的代码解释,而人类则更偏好简短的解释。我们发现大多数现有的代码质量维度存在显著的不一致,表明在现实的编码应用中,LLM评判者与人类偏好之间存在显著的契合度差距。
Summary / 总结
This study evaluates the biases of LLMs and developers in code evaluation by using TRACE, a framework that assesses LLMs' ability to predict human preferences and identifies systematic biases. Across three coding scenarios, the best LLMs underperform human annotators by 12-23%. TRACE reveals 35 significant sources of misalignment, with most corresponding to existing software engineering criteria, such as preference for shorter code explanations in chat-based coding scenarios.
研究使用名为TRACE的框架评估LLM和开发人员在代码评估中的偏见。TRACE衡量LLM裁判在三种编码场景下的表现:基于聊天的编程、IDE自动完成和指令式代码编辑。研究发现,最佳LLM裁判的表现比人类注释员低12-23%,并识别出35个显著的不一致来源,大多数与现有的软件工程代码质量标准相符,例如在基于聊天的编程场景中,裁判偏好更长的代码解释,而人类更偏好较短的解释。
The Stochastic Gap: A Markovian Framework for Pre-Deployment Reliability and Oversight-Cost Auditing in Agentic Artificial Intelligence
Authors: Biplab Pal, Santanu Bhattacharya
First: 2026-03-25T17:56:11+00:00 · Latest: 2026-03-25T17:56:11+00:00
Comments: 22 pages, 5 figures, submitted to Engineering Applications of Artificial Intelligence
Abstract
Agentic artificial intelligence (AI) in organizations is a sequential decision problem constrained by reliability and oversight cost. When deterministic workflows are replaced by stochastic policies over actions and tool calls, the key question is not whether a next step appears plausible, but whether the resulting trajectory remains statistically supported, locally unambiguous, and economically governable. We develop a measure-theoretic Markov framework for this setting. The core quantities are state blind-spot mass B_n(tau), state-action blind mass B^SA_{pi,n}(tau), an entropy-based human-in-the-loop escalation gate, and an expected oversight-cost identity over the workflow visitation measure.
We instantiate the framework on the Business Process Intelligence Challenge 2019 purchase-to-pay log (251,734 cases, 1,595,923 events, 42 distinct workflow actions) and construct a log-driven simulated agent from a chronological 80/20 split of the same process. The main empirical finding is that a large workflow can appear well supported at the state level while retaining substantial blind mass over next-step decisions: refining the operational state to include case context, economic magnitude, and actor class expands the state space from 42 to 668 and raises state-action blind mass from 0.0165 at tau=50 to 0.1253 at tau=1000. On the held-out split, m(s) = max_a pi-hat(a|s) tracks realized autonomous step accuracy within 3.4 percentage points on average.
The same quantities that delimit statistically credible autonomy also determine expected oversight burden. The framework is demonstrated on a large-scale enterprise procurement workflow and is designed for direct application to engineering processes for which operational event logs are available.
Summary / 总结
The paper addresses the challenge of ensuring reliability and oversight in agentic AI workflows, which are inherently stochastic. It develops a measure-theoretic Markov framework to evaluate the statistical support and economic governability of AI decisions. Key findings include that a large workflow can appear well-supported at the state level but still retain significant blind mass in next-step decisions, and that refining the operational state increases the state-action blind mass. The framework also shows that the same quantities that define statistically credible autonomy also determine the expected oversight burden.
论文针对代理型人工智能工作流中的可靠性和监督挑战,提出了一种测度论马尔可夫框架来评估AI决策的统计支持和经济可治理性。主要发现包括:一个大型工作流在状态层面可能看起来支持良好,但在下一步决策上仍可能保留大量盲区;通过细化操作状态,状态-动作盲区质量会增加。该框架还表明,定义统计上可信自主性的相同量度也决定了预期的监督负担。
Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA
Authors: Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, Tunazzina Islam
First: 2026-03-25T17:54:39+00:00 · Latest: 2026-03-25T17:54:39+00:00
Abstract
Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.
中文标题/摘要
标题:检索改进未必保证更好答案:RAG在AI政策问答中的研究
增强检索生成(RAG)系统在分析复杂政策文件方面越来越普遍,但在以密集法律语言和不断演变、重叠的监管框架为特征的领域中,实现专家级使用的足够可靠性仍然具有挑战性。我们使用AI治理和监管档案库(AGORA)语料库,该语料库包含947份AI政策文件,研究RAG在AI治理和政策分析中的应用。我们的系统结合了基于ColBERT的检索器,该检索器通过对比学习进行了微调,以及通过直接偏好优化(DPO)与人类偏好对齐的生成器。我们构建了合成查询并收集了成对偏好,以使系统适应政策领域。通过评估检索质量、答案相关性和忠实性实验,我们发现领域特定的微调可以提高检索指标,但并不一致地提高端到端的问答性能。在某些情况下,更强的检索反而会导致更自信的虚构,当相关文件未包含在语料库中时。这些结果突显了构建政策导向RAG系统的关键问题:单个组件的改进未必转化为更可靠的答案。我们的研究结果为设计动态监管语料库上的基于问题的问答系统提供了实用见解。
Summary / 总结
The study investigates the application of RAG systems to AI governance and policy analysis using the AGORA corpus. It combines a ColBERT-based retriever and a generator fine-tuned with DPO to improve retrieval and answer relevance. Experiments show that domain-specific fine-tuning enhances retrieval metrics but does not uniformly improve end-to-end question answering performance, sometimes even leading to more confident hallucinations when relevant documents are missing. This indicates that improvements in individual components may not ensure more reliable answers in policy-focused RAG systems.
研究使用AGORA语料库探讨了检索增强生成(RAG)系统在AI治理和政策分析中的应用。系统结合了基于ColBERT的检索器和通过对比学习和直接偏好优化(DPO)微调的生成器。实验表明,尽管领域特定的微调可以提高检索质量,但并不一定能提升端到端的问答性能。在某些情况下,更强的检索反而会导致更自信但更不准确的答案,特别是在相关文档缺失时。这些结果表明,在构建政策导向的RAG系统时,单个组件的改进并不一定能保证更可靠的答案。
MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination
Authors: Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie Hu, Yu Qin, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang
First: 2026-03-25T17:54:10+00:00 · Latest: 2026-03-25T17:54:10+00:00
Abstract
Hallucination remains a critical bottleneck for large language models (LLMs), undermining their reliability in real-world applications, especially in Retrieval-Augmented Generation (RAG) systems. While existing hallucination detection methods employ LLM-as-a-judge to verify LLM outputs against retrieved evidence, they suffer from inherent confirmation bias, where the verifier inadvertently reproduces the errors of the original generation. To address this, we introduce Multi-Agent Reinforced Self-Check for Hallucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate information asymmetry. MARCH orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker. The Solver generates an initial RAG response, which the Proposer decomposes into claim-level verifiable atomic propositions. Crucially, the Checker validates these propositions against retrieved evidence in isolation, deprived of the Solver's original output. This well-crafted information asymmetry scheme breaks the cycle of self-confirmation bias. By training this pipeline with multi-agent reinforcement learning (MARL), we enable the agents to co-evolve and optimize factual adherence. Extensive experiments across hallucination benchmarks demonstrate that MARCH substantially reduces hallucination rates. Notably, an 8B-parameter LLM equipped with MARCH achieves performance competitive with powerful closed-source models. MARCH paves a scalable path for factual self-improvement of LLMs through co-evolution. The code is at https://github.com/Qwen-Applications/MARCH.
中文标题/摘要
标题:MARCH:多智能体强化自我检查以应对LLM幻觉
幻觉仍然是大型语言模型(LLM)的关键瓶颈,削弱了它们在实际应用中的可靠性,尤其是在检索增强生成(RAG)系统中。虽然现有的幻觉检测方法使用LLM作为裁判来验证LLM输出与检索证据的一致性,但它们会受到固有的确认偏见的影响,即验证者无意中复制了原始生成的错误。为了解决这一问题,我们提出了多智能体强化自我检查以应对幻觉(MARCH)框架,该框架通过利用故意的信息不对称来确保严格的事实一致性。MARCH 组织了一个协作的三个专门智能体的管道:求解器、提议者和检查者。求解器生成初始的RAG响应,提议者将其分解为可验证的原子命题。关键的是,检查者在孤立的情况下验证这些命题与检索到的证据,不接触求解器的原始输出。这种精心设计的信息不对称方案打破了自我确认偏见的循环。通过使用多智能体强化学习(MARL)训练此管道,我们使智能体能够共同进化并优化事实一致性。在幻觉基准测试中的广泛实验表明,MARCH 显著降低了幻觉率。值得注意的是,配备MARCH的8B参数LLM在性能上与强大的闭源模型相当。MARCH 为LLM通过共同进化实现事实自我改进铺平了可扩展的道路。代码位于https://github.com/Qwen-Applications/MARCH。
Summary / 总结
MARCH addresses the issue of hallucination in large language models (LLMs) by introducing a multi-agent reinforcement learning framework. It uses a collaborative pipeline of a Solver, a Proposer, and a Checker to enforce factual alignment. The Checker validates claims against evidence without seeing the Solver's output, reducing self-confirmation bias. Experiments show that MARCH significantly reduces hallucination rates, even with a 8B-parameter model achieving competitive performance with closed-source models.
MARCH 是一个框架,旨在通过多智能体强化学习方法减少大型语言模型(LLMs)中的幻觉。它引入了一个协作管道,包含三个专门的智能体:Solver、Proposer 和 Checker。Checker 独立于 Solver 的输出验证声明,打破自我确认偏见。实验表明,MARCH 显著降低了幻觉率,一个 8B 参数的 LLM 达到了与闭源模型相当的性能。
Vision-Language Models vs Human: Perceptual Image Quality Assessment
Authors: Imran Mehmood, Imad Ali Shah, Ming Ronnier Luo, Brian Deegan
First: 2026-03-25T17:54:07+00:00 · Latest: 2026-03-25T17:54:07+00:00
Abstract
Psychophysical experiments remain the most reliable approach for perceptual image quality assessment (IQA), yet their cost and limited scalability encourage automated approaches. We investigate whether Vision Language Models (VLMs) can approximate human perceptual judgments across three image quality scales: contrast, colorfulness and overall preference. Six VLMs four proprietary and two openweight models are benchmarked against psychophysical data. This work presents a systematic benchmark of VLMs for perceptual IQA through comparison with human psychophysical data. The results reveal strong attribute dependent variability models with high human alignment for colorfulness (ρup to 0.93) underperform on contrast and vice-versa. Attribute weighting analysis further shows that most VLMs assign higher weights to colorfulness compared to contrast when evaluating overall preference similar to the psychophysical data. Intramodel consistency analysis reveals a counterintuitive tradeoff: the most self consistent models are not necessarily the most human aligned suggesting response variability reflects sensitivity to scene dependent perceptual cues. Furthermore, human-VLM agreement is increased with perceptual separability, indicating VLMs are more reliable when stimulus differences are clearly expressed.
中文标题/摘要
标题:视觉语言模型与人类:感知图像质量评估
心理物理实验仍然是最可靠的感知图像质量评估(IQA)方法,但其成本和有限的可扩展性促使了自动化方法的发展。我们研究视觉语言模型(VLMs)是否能在对比度、色彩饱和度和总体偏好三个图像质量尺度上逼近人类的感知判断。六种VLMs(四种专有模型和两种开源模型)被与心理物理数据进行基准测试。本研究通过与人类心理物理数据的比较,系统地评估了VLMs在感知IQA中的表现。结果表明,色彩饱和度的属性依赖性变异模型与人类高度一致(ρ最高可达0.93),但在对比度上表现较差,反之亦然。属性加权分析进一步显示,大多数VLMs在评估总体偏好时,对色彩饱和度的权重高于对比度,这与心理物理数据相似。模型内部一致性分析揭示了一个反直觉的权衡:最一致的模型未必最接近人类,表明响应变异反映了对场景依赖性感知线索的敏感性。此外,人类与VLMs的一致性随着感知可分辨性的增加而提高,表明当刺激差异明显时,VLMs更为可靠。
Is Multilingual LLM Watermarking Truly Multilingual? Scaling Robustness to 100+ Languages via Back-Translation
Authors: Asim Mohamed, Martin Gubri
First: 2025-10-20T18:51:20+00:00 · Latest: 2026-03-25T17:52:45+00:00
Abstract
Multilingual watermarking aims to make large language model (LLM) outputs traceable across languages, yet current methods still fall short. Despite claims of cross-lingual robustness, they are evaluated only on high-resource languages. We show that existing multilingual watermarking methods are not truly multilingual: they fail to remain robust under translation attacks in medium- and low-resource languages. We trace this failure to semantic clustering, which fails when the tokenizer vocabulary contains too few full-word tokens for a given language. To address this, we introduce STEAM, a detection method that uses Bayesian optimisation to search among 133 candidate languages for the back-translation that best recovers the watermark strength. It is compatible with any watermarking method, robust across different tokenizers and languages, non-invasive, and easily extendable to new languages. With average gains of +0.23 AUC and +37% TPR@1%, STEAM provides a scalable approach toward fairer watermarking across the diversity of languages.
中文标题/摘要
标题:多语言LLM水印真能适用于100多种语言吗?通过反向翻译扩展鲁棒性
多语言水印旨在使大型语言模型(LLM)输出在多种语言中可追溯,但当前方法仍存在不足。尽管声称具有跨语言鲁棒性,但它们仅在高资源语言上进行评估。我们表明,现有的多语言水印方法并非真正多语言:它们在中低资源语言下的翻译攻击中无法保持鲁棒性。我们追踪这一失败的原因在于语义聚类,当分词词汇表中包含的完整单词令牌数量不足以表示给定语言时,语义聚类会失效。为解决这一问题,我们引入了STEAM,这是一种使用贝叶斯优化在133种候选语言中搜索最佳反向翻译以恢复水印强度的检测方法。它与任何水印方法兼容,跨不同分词器和语言鲁棒,非侵入性且易于扩展到新语言。通过平均AUC提高0.23和TPR@1%提高37%,STEAM提供了一种针对语言多样性进行公平水印的可扩展方法。
Summary / 总结
The research aims to enhance the robustness of multilingual watermarking methods across a wide range of languages, addressing their limitations in medium- and low-resource languages. The study introduces STEAM, a method using Bayesian optimization to find the best back-translation among 133 languages, which improves watermark recovery by an average of 0.23 AUC and 37% TPR@1%. This approach is compatible with various watermarking techniques and languages, offering a scalable solution for fairer watermarking.
研究旨在增强多语言水印方法在多种语言中的鲁棒性,解决其在中低资源语言中的局限性。研究引入了STEAM方法,使用贝叶斯优化在133种语言中寻找最佳反向翻译,平均提高水印恢复的AUC 0.23和TPR@1% 37%。该方法兼容多种水印技术及语言,提供了一种针对语言多样性的公平水印的可扩展解决方案。
Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation
Authors: Xinying Guo, Chenxi Jiang, Hyun Bin Kim, Ying Sun, Yang Xiao, Yuhang Han, Jianfei Yang
First: 2026-03-25T17:52:43+00:00 · Latest: 2026-03-25T17:52:43+00:00
Comments: Code is available at https://github.com/gxyes/MARS_Chameleon
Abstract
Robotic manipulation often requires memory: occlusion and state changes can make decision-time observations perceptually aliased, making action selection non-Markovian at the observation level because the same observation may arise from different interaction histories. Most embodied agents implement memory via semantically compressed traces and similarity-based retrieval, which discards disambiguating fine-grained perceptual cues and can return perceptually similar but decision-irrelevant episodes. Inspired by human episodic memory, we propose Chameleon, which writes geometry-grounded multimodal tokens to preserve disambiguating context and produces goal-directed recall through a differentiable memory stack. We also introduce Camo-Dataset, a real-robot UR5e dataset spanning episodic recall, spatial tracking, and sequential manipulation under perceptual aliasing. Across tasks, Chameleon consistently improves decision reliability and long-horizon control over strong baselines in perceptually confusable settings.
中文标题/摘要
标题:变色龙:长时程机器人操作的记忆
机器人操作通常需要记忆:遮挡和状态变化可能会使决策时的观察在感知上产生混淆,使得在观察层面的动作选择非马尔可夫,因为相同的观察可能来自不同的交互历史。大多数具身智能体通过语义压缩轨迹和基于相似性的检索来实现记忆,这会丢弃区分性的细粒度感知线索,并可能返回感知上相似但决策无关的事件。受人类事件记忆的启发,我们提出了变色龙,它将几何基础的多模态令牌写入以保留区分性上下文,并通过可微分的记忆堆栈实现目标导向的检索。我们还引入了Camo-数据集,这是一个跨越事件回忆、空间跟踪和感知混淆下的顺序操作的UR5e真实机器人数据集。在各种任务中,变色龙在感知上混淆的环境中始终能够提高决策可靠性和长时程控制,优于强大的基线。
Summary / 总结
The research aims to address the non-Markovian nature of robotic manipulation due to perceptual aliasing caused by occlusions and state changes. Chameleon, a system inspired by human episodic memory, uses geometry-grounded multimodal tokens to preserve disambiguating context and a differentiable memory stack for goal-directed recall. Experiments show that Chameleon improves decision reliability and long-horizon control in perceptually confusable settings compared to strong baselines.
论文解决了机器人操作中由于感知混叠导致决策非马尔可夫性的记忆问题。提出了Chameleon,使用几何导向的多模态令牌来保存区分性上下文,并通过可微记忆堆栈实现目标导向的回忆。实验表明,Chameleon在感知混叠的场景中相比强基线提高了决策可靠性和长时控制能力。
VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models
Authors: Qijia He, Xunmei Liu, Hammaad Memon, Ziang Li, Zixian Ma, Jaemin Cho, Jason Ren, Daniel S Weld, Ranjay Krishna
First: 2026-03-25T17:52:23+00:00 · Latest: 2026-03-25T17:52:23+00:00
Abstract
Scalable Vector Graphics (SVG) are an essential format for technical illustration and digital design, offering precise resolution independence and flexible semantic editability. In practice, however, original vector source files are frequently lost or inaccessible, leaving only "flat" rasterized versions (e.g., PNG or JPEG) that are difficult to modify or scale. Manually reconstructing these figures is a prohibitively labor-intensive process, requiring specialized expertise to recover the original geometric intent. To bridge this gap, we propose VFIG, a family of Vision-Language Models trained for complex and high-fidelity figure-to-SVG conversion. While this task is inherently data-driven, existing datasets are typically small-scale and lack the complexity of professional diagrams. We address this by introducing VFIG-DATA, a large-scale dataset of 66K high-quality figure-SVG pairs, curated from a diverse mix of real-world paper figures and procedurally generated diagrams. Recognizing that SVGs are composed of recurring primitives and hierarchical local structures, we introduce a coarse-to-fine training curriculum that begins with supervised fine-tuning (SFT) to learn atomic primitives and transitions to reinforcement learning (RL) refinement to optimize global diagram fidelity, layout consistency, and topological edge cases. Finally, we introduce VFIG-BENCH, a comprehensive evaluation suite with novel metrics designed to measure the structural integrity of complex figures. VFIG achieves state-of-the-art performance among open-source models and performs on par with GPT-5.2, achieving a VLM-Judge score of 0.829 on VFIG-BENCH.
中文标题/摘要
标题:VFIG:使用视觉语言模型在SVG中矢量化复杂图形
可缩放矢量图形(SVG)是技术插图和数字设计中不可或缺的格式,提供精确的分辨率独立性和灵活的语义可编辑性。然而,在实践中,原始矢量源文件经常丢失或不可访问,只剩下难以修改或缩放的“扁平”位图版本(例如,PNG或JPEG)。手动重建这些图形是一个劳动密集型过程,需要专门的技能来恢复原始的几何意图。为了解决这一问题,我们提出了VFIG,一种用于复杂和高保真图形到SVG转换的视觉语言模型家族。尽管这项任务本质上是数据驱动的,但现有数据集通常规模较小且缺乏专业图表的复杂性。为此,我们引入了VFIG-DATA,这是一个包含66,000个高质量图形-SVG配对的大规模数据集,这些配对是从各种真实世界论文图形和程序生成的图表中精心挑选出来的。认识到SVG由反复出现的基本元素和分层局部结构组成,我们引入了一种从监督微调(SFT)开始的粗到细的训练课程,学习基本元素,然后过渡到强化学习(RL)优化以优化全局图表保真度、布局一致性以及拓扑边缘情况。最后,我们引入了VFIG-BENCH,这是一个全面的评估套件,包含新的度量标准,用于衡量复杂图形的结构完整性。VFIG在开源模型中达到了最先进的性能,并且与GPT-5.2的表现相当,在VFIG-BENCH上的VLM-Judge得分为0.829。
Summary / 总结
VFIG is a Vision-Language Model designed for converting complex rasterized figures into scalable vector graphics (SVG). It uses a large-scale dataset, VFIG-DATA, consisting of 66K figure-SVG pairs, and a training curriculum that starts with supervised fine-tuning and transitions to reinforcement learning. VFIG outperforms existing models, achieving a VLM-Judge score of 0.829 on VFIG-BENCH, which evaluates structural integrity in complex figures.
VFIG 是一种用于将复杂矢量化图像转换为可缩放矢量图形 (SVG) 的视觉-语言模型。它使用包含 66,000 个图-SVG 对的大规模数据集 VFIG-DATA,并采用一种从监督微调开始,然后过渡到强化学习以优化全局一致性和布局一致性的训练课程。VFIG 在 VFIG-BENCH 评估套件上的 VLM-Judge 得分为 0.829,优于现有模型。
Towards Training-Free Scene Text Editing
Authors: Yubo Li, Xugong Qin, Peng Zhang, Hailun Lin, Gangyan Zeng, Kexin Zhang
Venue: CVPR 2026
First: 2026-03-25T17:50:31+00:00 · Latest: 2026-03-25T17:50:31+00:00
Comments: Accepted by CVPR 2026
Abstract
Scene text editing seeks to modify textual content in natural images while maintaining visual realism and semantic consistency. Existing methods often require task-specific training or paired data, limiting their scalability and adaptability. In this paper, we propose TextFlow, a training-free scene text editing framework that integrates the strengths of Attention Boost (AttnBoost) and Flow Manifold Steering (FMS) to enable flexible, high-fidelity text manipulation without additional training. Specifically, FMS preserves the structural and style consistency by modeling the visual flow of characters and background regions, while AttnBoost enhances the rendering of textual content through attention-based guidance. By jointly leveraging these complementary modules, our approach performs end-to-end text editing through semantic alignment and spatial refinement in a plug-and-play manner. Extensive experiments demonstrate that our framework achieves visual quality and text accuracy comparable to or superior to those of training-based counterparts, generalizing well across diverse scenes and languages. This study advances scene text editing toward a more efficient, generalizable, and training-free paradigm. Code is available at https://github.com/lyb18758/TextFlow
中文标题/摘要
标题:朝向无需训练的场景文本编辑
场景文本编辑旨在修改自然图像中的文本内容,同时保持视觉真实性和语义一致性。现有方法通常需要特定任务的训练或配对数据,限制了其可扩展性和适应性。在本文中,我们提出了一种无需训练的场景文本编辑框架TextFlow,该框架结合了Attention Boost (AttnBoost) 和Flow Manifold Steering (FMS) 的优点,能够在无需额外训练的情况下实现灵活、高保真的文本操作。具体而言,FMS通过建模字符和背景区域的视觉流来保持结构和风格的一致性,而AttnBoost则通过基于注意力的指导增强文本内容的渲染。通过联合利用这些互补模块,我们的方法以语义对齐和空间细化的方式实现端到端的文本编辑。大量实验表明,我们的框架在视觉质量和文本准确性方面与基于训练的方法相当或更优,并且在多种场景和语言中具有良好的泛化能力。这项研究推动了场景文本编辑向更高效、更具通用性和无需训练的范式发展。代码可在https://github.com/lyb18758/TextFlow 获取
Summary / 总结
The paper addresses the challenge of scene text editing by proposing TextFlow, a training-free framework that combines Attention Boost and Flow Manifold Steering to maintain visual realism and semantic consistency. The method enables flexible and high-fidelity text manipulation without additional training, achieving comparable or superior visual quality and text accuracy to training-based methods across various scenes and languages. Experiments show its effectiveness and generalizability.
论文提出了一种无需训练的场景文本编辑框架TextFlow,结合了AttnBoost和FMS。AttnBoost通过注意力机制增强文本渲染,而FMS通过建模视觉流来保持结构和风格的一致性。实验表明,TextFlow在视觉质量和文本准确性方面达到了与或优于基于训练的方法的水平,并且在各种场景和语言中具有良好的泛化能力。
Anti-I2V: Safeguarding your photos from malicious image-to-video generation
Authors: Duc Vu, Anh Nguyen, Chi Tran, Anh Tran
Venue: CVPR 2026
First: 2026-03-25T17:48:10+00:00 · Latest: 2026-03-25T17:48:10+00:00
Comments: Accepted to CVPR 2026 (Main Conference)
Abstract
Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person's photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the $L$*$a$*$b$* and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.
中文标题/摘要
标题:Anti-I2V:保护照片免受恶意图像到视频生成的侵害
基于扩散的视频生成模型的进步虽然显著提高了人类动画的质量,但也带来了通过特定人的照片和文本提示生成假视频的滥用风险。最近的努力集中在对抗性攻击上,通过引入精心构造的扰动来保护图像免受扩散模型的影响。然而,大多数现有方法针对的是图像生成,而相对较少的方法明确地针对图像到视频扩散模型(VDMs),并且大多数主要集中在基于UNet的架构上。因此,它们对扩散变换器(DiT)模型的有效性仍然很大程度上未被探索,因为这些模型由于更大的容量和先进的注意力机制,表现出改进的特征保留和更强的时间一致性。在本文中,我们引入了Anti-I2V,这是一种针对恶意人类图像到视频生成的新型防御方法,适用于各种扩散基础架构。Anti-I2V 不仅在 RGB 空间中更新噪声,还在 L*a*b* 和频域中操作,提高了鲁棒性并集中在显著像素上。然后,我们确定了在去噪过程中捕捉到最独特语义特征的网络层,设计了适当的训练目标,以最大化时间连贯性和生成保真度的降级。通过广泛的验证,Anti-I2V 在多种视频扩散模型中展示了最先进的防御性能,提供了一个有效的问题解决方案。
Summary / 总结
The research aims to protect individuals from the misuse of their photos in malicious video generation by diffusion-based models. Anti-I2V is a novel defense method that operates in both the $L$*$a$*$b$* and frequency domains, focusing on salient pixels to enhance robustness. It identifies key network layers to degrade temporal coherence and generation fidelity, showing superior performance against various video diffusion models.
研究动机是防止基于扩散的视频生成模型被滥用,这些模型可以从一个人的照片生成假视频。主要方法是Anti-I2V,它在$L$*$a$*$b$*和频域中操作,并专注于显著像素。关键实验发现表明,Anti-I2V在对抗各种视频扩散模型时表现出色,提供了一个有效的防恶意图像到视频生成问题的解决方案。
POLY-SIM: Polyglot Speaker Identification with Missing Modality Grand Challenge 2026 Evaluation Plan
Authors: Marta Moscati, Muhammad Saad Saeed, Marina Zanoni, Mubashir Noman, Rohan Kumar Das, Monorama Swain, Yufang Hou, Elisabeth Andre, Khalid Mahmood Malik, Markus Schedl, Shah Nawaz
Venue: ACM MM 2026
First: 2026-03-25T17:47:00+00:00 · Latest: 2026-03-25T17:47:00+00:00
Comments: Grand challenge at ACM MM 2026
Abstract
Multimodal speaker identification systems typically assume the availability of complete and homogeneous audio-visual modalities during both training and testing. However, in real-world applications, such assumptions often do not hold. Visual information may be missing due to occlusions, camera failures, or privacy constraints, while multilingual speakers introduce additional complexity due to linguistic variability across languages. These challenges significantly affect the robustness and generalization of multimodal speaker identification systems. The POLY-SIM Grand Challenge 2026 aims to advance research in multimodal speaker identification under missing-modality and cross-lingual conditions. Specifically, the Grand Challenge encourages the development of robust methods that can effectively leverage incomplete multimodal inputs while maintaining strong performance across different languages. This report presents the design and organization of the POLY-SIM Grand Challenge 2026, including the dataset, task formulation, evaluation protocol, and baseline model. By providing a standardized benchmark and evaluation framework, the challenge aims to foster progress toward more robust and practical multimodal speaker identification systems.
中文标题/摘要
标题:POLY-SIM:多语种缺失模态演讲者识别2026年挑战计划
多模态演讲者识别系统通常假设在训练和测试过程中音频-视觉模态是完整且一致的。然而,在实际应用中,这种假设往往不成立。视觉信息可能由于遮挡、摄像机故障或隐私限制而缺失,而多语种演讲者则由于语言间的差异性增加了额外的复杂性。这些挑战显著影响了多模态演讲者识别系统的鲁棒性和泛化能力。POLY-SIM 2026 大挑战旨在推进在缺失模态和跨语言条件下多模态演讲者识别的研究。具体而言,该挑战鼓励开发能够有效利用不完整多模态输入并保持在不同语言中强大性能的稳健方法。本报告介绍了POLY-SIM 2026 大挑战的设计和组织,包括数据集、任务定义、评估协议和基线模型。通过提供标准化的基准和评估框架,该挑战旨在促进更稳健和实用的多模态演讲者识别系统的发展。
Summary / 总结
The POLY-SIM Grand Challenge 2026 addresses the robustness of multimodal speaker identification systems under missing visual data and cross-lingual conditions. The challenge involves developing methods that can effectively use incomplete multimodal inputs and perform well across different languages. Key experimental findings include the design and organization of the challenge, which includes a standardized dataset, task formulation, evaluation protocol, and baseline model, aiming to advance research in this area.
POLY-SIM 2026 大挑战旨在解决在缺失模态和跨语言条件下多模态说话人识别系统的鲁棒性问题。它评估能够有效利用不完整多模态输入并在不同语言下保持良好性能的方法。该挑战包括标准化的数据集、任务定义和评估协议,以推动该领域的研究进展。
Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction
Authors: Haresh Rengaraj Rajamohan, Xiang Gao, Weicheng Zhu, Shih-Lun Huang, Long Chen, Gabe Schulman, Huizhen Jin, Shengduo Li, Yixuan Wang, Huidi Yang, Kyunghyun Cho, Cem M. Deniz, Narges Razavian
First: 2026-03-25T17:42:47+00:00 · Latest: 2026-03-25T17:42:47+00:00
Abstract
While large-scale pretraining has revolutionized language modeling, its potential remains underexplored in healthcare with structured electronic health records (EHRs). We present RAVEN, a novel generative pretraining strategy for sequential EHR data based on Recurrence-Aware next-Visit EveNt prediction. Leveraging a dataset of over one million unique individuals, our model learns to autoregressively generate tokenized clinical events for the next visit conditioned on patient history. We introduce regularization on predicting repeated events and highlight a key pitfall in EHR-based foundation model evaluations: repeated event tokens can inflate performance metrics when new onsets are not distinguished from subsequent occurrences. Furthermore, we empirically investigate the scaling behaviors in a data-constrained, compute-saturated regime, showing that simply increasing model size is suboptimal without commensurate increases in data volume. We evaluate our model via zero-shot prediction for forecasting the incidence of a diverse set of diseases, where it rivals fully fine-tuned representation-based Transformer models and outperforms widely used simulation-based next-token approaches. Finally, without additional parameter updates, we show that RAVEN can generalize to an external patient cohort under lossy clinical code mappings and feature coverage gaps.
中文标题/摘要
标题:基于下次就诊预测的复发感知基础模型在临床记录中的扩展
尽管大规模预训练已经革新了语言模型,但在医疗保健领域,特别是在结构化的电子健康记录(EHRs)中,其潜力尚未得到充分探索。我们提出了RAVEN,一种基于复发感知下次就诊事件预测的新颖生成预训练策略,用于序贯EHR数据。利用超过一百万独特个体的数据集,我们的模型学习在患者历史的基础上自回归生成标记化的临床事件,以预测下次就诊。我们引入了预测重复事件的正则化,并指出基于EHR的基础模型评估中的一个关键问题:重复事件标记可以因未能区分初次发病和后续发生而虚增性能指标。此外,我们在数据受限、计算饱和的环境中实证研究了扩展行为,表明单纯增加模型规模在没有相应增加数据量的情况下是次优的。我们通过零样本预测评估了该模型,用于预测多种疾病发病率,其性能与完全微调的基于表示的Transformer模型相当,并优于广泛使用的基于模拟的下一个标记方法。最后,在没有额外参数更新的情况下,我们展示了RAVEN在临床代码映射和特征覆盖缺口下可以泛化到外部患者群体。
Summary / 总结
RAVEN is a novel generative pretraining strategy for sequential electronic health records (EHRs) based on recurrence-aware next-visit event prediction. It leverages over one million unique individuals to learn to autoregressively generate tokenized clinical events for the next visit conditioned on patient history. The model introduces regularization for repeated events and demonstrates that simply increasing model size without commensurate data volume increases is suboptimal. RAVEN outperforms simulation-based next-token approaches in zero-shot disease forecasting and generalizes well to an external patient cohort under lossy clinical code mappings and feature coverage gaps.
RAVEN 是一种基于复发感知下次访问预测的临床记录生成预训练策略。它利用患者历史信息自回归地生成下一个访问的临床事件。关键发现包括在性能指标中区分新发事件和重复事件的重要性,以及单纯增加模型规模而没有更多数据的次优扩展行为。RAVEN 在零样本疾病预测中优于基于模拟的方法,并且在临床代码映射和特征覆盖存在差距的情况下也能很好地泛化。
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Authors: Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao
First: 2026-02-18T14:19:01+00:00 · Latest: 2026-03-25T17:42:42+00:00
Comments: 8 pages
Abstract
Existing Multi-Agent Systems (MAS) typically rely on homogeneous model configurations, failing to exploit the diverse expertise inherent in different post-trained architectures. We propose Team-of-Thoughts, a heterogeneous MAS framework that treats diverse models as specialized tools within an orchestrator-driven paradigm. Team-of-Thoughts introduces two novel components: (1) Orchestrator Calibration, which identifies models with superior coordination and synthesis capabilities, and (2) Agent Self-Assessment, a protocol where tool agents profile their own domain-specific strengths to guide selection. At inference, the orchestrator dynamically activates the most compatible agents based on these profiles to maximize capability coverage. Across five mathematical reasoning and code generation benchmarks, Team-of-Thoughts consistently outperforms individual models and existing MAS baselines. Notably, on AIME24 and LiveCodeBench, Team-of-Thoughts achieves 96.00% and 77.91% accuracy, respectively, significantly improving over homogeneous role-play baselines (80.00% and 65.93%).
中文标题/摘要
标题:思想团队:通过协调工具调用实现代理系统测试时高效扩展
现有的多代理系统(MAS)通常依赖于同质模型配置,未能充分利用不同后训练架构中固有的多样化专业知识。我们提出了一种异构MAS框架——思想团队,该框架将不同的模型视为在协调器驱动范式下的专门工具。思想团队引入了两个新颖的组件:(1)协调器校准,用于识别具有更优协调和综合能力的模型;(2)代理自我评估,一种工具代理自我评估其领域特定优势的协议,以指导选择。在推理时,协调器根据这些评估动态激活最兼容的代理,以最大化能力覆盖。在五个数学推理和代码生成基准测试中,思想团队始终优于单一模型和现有MAS基线。值得注意的是,在AIME24和LiveCodeBench上,思想团队分别达到了96.00%和77.91%的准确率,显著优于同质角色扮演基线(80.00%和65.93%)。
Summary / 总结
The research aims to enhance the efficiency and performance of Multi-Agent Systems (MAS) by leveraging the diverse expertise of different post-trained models. It introduces Team-of-Thoughts, a heterogeneous MAS framework that includes an Orchestrator Calibration component to identify models with superior coordination and synthesis capabilities, and an Agent Self-Assessment protocol for agents to profile their strengths. During inference, the orchestrator dynamically selects the most compatible agents to maximize capability coverage. The framework consistently outperforms individual models and existing MAS baselines across five benchmarks, particularly achieving 96.00% and 77.91% accuracy on AIME24 and LiveCodeBench, respectively, surpassing homogeneous role-play baselines by significant margins.
研究旨在通过利用不同后训练模型的多样化专长来提升多代理系统(MAS)的表现。提出了Team-of-Thoughts,这是一种异构MAS框架,包括一个 orchestrator Calibration 组件来识别具有更强协调和综合能力的模型,以及一个代理自评估协议,让代理评估自己的领域专长以指导选择。在推理时,orchestrator 根据这些评估动态选择最兼容的代理以最大化能力覆盖。该框架在五个基准测试中均优于单一模型和现有MAS基线,分别在AIME24和LiveCodeBench上达到96.00%和77.91%的准确率,显著超越同质角色扮演基线。
The Free-Market Algorithm: Self-Organizing Optimization for Open-Ended Complex Systems
Authors: Martin Jaraiz
First: 2026-03-25T17:41:25+00:00 · Latest: 2026-03-25T17:41:25+00:00
Comments: 26 pages, 3 figures, 2 tables, draft
Abstract
We introduce the Free-Market Algorithm (FMA), a novel metaheuristic inspired by free-market economics. Unlike Genetic Algorithms, Particle Swarm Optimization, and Simulated Annealing -- which require prescribed fitness functions and fixed search spaces -- FMA uses distributed supply-and-demand dynamics where fitness is emergent, the search space is open-ended, and solutions take the form of hierarchical pathway networks. Autonomous agents discover rules, trade goods, open and close firms, and compete for demand with no centralized controller.
FMA operates through a three-layer architecture: a universal market mechanism (supply, demand, competition, selection), pluggable domain-specific behavioral rules, and domain-specific observation. The market mechanism is identical across applications; only the behavioral rules change.
Validated in two unrelated domains. In prebiotic chemistry, starting from 900 bare atoms (C, H, O, N), FMA discovers all 12 feasible amino acid formulas, all 5 nucleobases, the formose sugar chain, and Krebs cycle intermediates in under 5 minutes on a laptop -- with up to 240 independent synthesis routes per product. In macroeconomic forecasting, reading a single input-output table with zero estimated parameters, FMA achieves Mean Absolute Error of 0.42 percentage points for non-crisis GDP prediction, comparable to professional forecasters, portable to 33 countries.
Assembly Theory alignment shows that FMA provides the first explicit, tunable mechanism for the selection signatures described by Sharma et al. (Nature, 2023). The event-driven assembly dynamics resonate with foundational programs in physics -- causal set theory, relational quantum mechanics, constructor theory -- suggesting that Darwinian market dynamics may reflect a deeper organizational principle that lead to the unfolding of Nature itself.
中文标题/摘要
标题:自由市场算法:自组织优化方法论用于开放复杂系统
我们引入了自由市场算法(FMA),这是一种受自由市场经济启发的新型元启发式算法。与遗传算法、粒子群优化和模拟退火不同,FMA 不需要预设的适应度函数和固定的搜索空间,而是利用分布式供需动态,其中适应度是涌现的,搜索空间是开放的,解决方案表现为分层路径网络。自主代理发现规则、交易商品、开设和关闭企业,并在没有中央控制器的情况下竞争需求。
FMA 通过三层架构运行:通用市场机制(供应、需求、竞争、选择),可插拔的领域特定行为规则,以及领域特定观察。市场机制在所有应用中都是相同的;只有行为规则会改变。
在两个不相关的领域中得到了验证。在前生物化学中,从900个裸原子(C、H、O、N)开始,FMA 在不到5分钟的时间内(使用笔记本电脑)发现了所有12种可行的氨基酸公式、所有5种核苷酸、福尔马林糖链和克氏循环中间体,每种产品最多有240条独立合成路线。在宏观经济预测中,仅读取一个投入产出表,零估计参数,FMA 实现了非危机GDP预测的平均绝对误差为0.42个百分点,与专业预测者相当,并且可以移植到33个国家。
装配理论对齐表明,FMA 提供了 Sharma 等人(Nature, 2023)描述的选择特征的第一个明确、可调机制。事件驱动的装配动力学与物理学中的基础程序——因果集理论、关系量子力学、构造理论——相呼应,表明达尔文市场动力学可能反映了更深层次的组织原则,这种原则导致了自然本身的展开。
Summary / 总结
The Free-Market Algorithm (FMA) is a novel metaheuristic inspired by free-market economics, designed for open-ended complex systems. Unlike traditional algorithms that require fixed fitness functions and search spaces, FMA uses distributed supply-and-demand dynamics to discover solutions through autonomous agents that trade and compete without a central controller. FMA was validated in two domains: it synthesized all 12 amino acid formulas, 5 nucleobases, and metabolic intermediates from 900 atoms in under 5 minutes, and it achieved comparable accuracy to professional forecasters in macroeconomic predictions with minimal parameters. This aligns with Assembly Theory and suggests that FMA might reflect deeper organizational principles in nature.
自由市场算法(FMA)是一种受自由市场经济启发的新颖元启发式算法,适用于开放复杂系统。不同于传统算法需要固定的目标函数和搜索空间,FMA 使用分布式供需动态,通过自主代理进行交易和竞争,无需中央控制器。FMA 在两个领域得到了验证:从 900 个原子(C、H、O、N)合成了 12 种氨基酸、5 种核苷酸和代谢中间体,仅需 5 分钟;并在宏观经济预测中实现了与专业预测者相当的准确性,且参数极少。这与装配理论相契合,表明 FMA 可能反映了自然深层次的组织原则。
Knot-10:A Tightness-Stratified Benchmark for Real-World Knot Classification with Topological Difficulty Analysis
Authors: Shiheng Nie, Yunguang Yue
First: 2026-03-24T14:50:34+00:00 · Latest: 2026-03-25T17:39:26+00:00
Comments: 48 pages, 12 figures, 10 supplementary sections
Abstract
Physical knot classification is a fine-grained visual classification (FGVC) scenario in which appearance cues are deliberately suppressed: different classes share the same rope material, color, and background, and class identity resides primarily in crossing structure. We introduce the Knots-10 benchmark, comprising 1,440 images with a deployment-oriented split that trains on loosely tied knots and tests on tightly dressed ones. Swin-T and TransFG both average 97.2% accuracy; PMG scores 94.5%, consistent with the hypothesis that jigsaw shuffling disrupts crossing continuity. McNemar tests cannot separate four of the five general-purpose backbones, so small ranking margins should be interpreted with caution. A Mantel permutation test shows that topological distance significantly correlates with confusion patterns in three of the five models (p < 0.01). We propose TACA regularization, which improves embedding-topology alignment from rho=0.46 to rho=0.65 without improving classification accuracy; a random-distance ablation yields comparable alignment, indicating the benefit is likely driven by generic regularization. A pilot cross-domain test with 100 phone photographs reveals a 58-69 percentage-point accuracy drop, exposing rope appearance bias as the dominant failure mode.
中文标题/摘要
标题:Knot-10:基于拓扑难度分析的现实世界绳结分类紧致基准
物理绳结分类是细粒度视觉分类(FGVC)场景之一,在此场景中,外观线索被故意抑制:不同类别共享相同的绳索材料、颜色和背景,类别身份主要在于交叉结构。我们引入了Knots-10基准,包含1440张图像,并采用面向部署的划分方式,训练集包含松散打结的绳结,测试集包含紧密打结的绳结。Swin-T和TransFG的平均准确率均为97.2%;PMG得分为94.5%,这与假设拼图打乱会破坏交叉连续性是一致的。McNemar检验无法区分五种通用骨干网络中的四种,因此小排名差距应谨慎解释。Mantel排列检验显示,在五种模型中的三种中,拓扑距离与混淆模式显著相关(p < 0.01)。我们提出了TACA正则化,该方法在不提高分类准确率的情况下,将嵌入-拓扑对齐从ρ=0.46提高到ρ=0.65;随机距离消融实验得到相似的对齐效果,表明其益处可能由通用正则化驱动。一项使用100张手机照片的跨域测试显示,准确率下降了58-69个百分点,揭示了绳索外观偏差为主要失败模式。
Summary / 总结
The research aims to develop a benchmark for classifying physical knots based on their crossing structure, as appearance cues are suppressed. The Knots-10 benchmark includes 1,440 images with a split that tests on tightly dressed knots. Models like Swin-T and TransFG achieve high accuracy, while PMG shows lower performance due to disrupted crossing continuity. TACA regularization improves embedding-topology alignment, and a cross-domain test with phone photographs highlights the appearance bias as a major challenge.
研究旨在开发一个基于交叉结构的物理绳结分类基准,解决外观相似性带来的挑战。Knots-10基准包含1,440张图像,按绳结紧度分为训练集和测试集。Swin-T和TransFG等模型达到高准确率,而PMG因交叉连续性被破坏而表现较低。TACA正则化提高了嵌入-拓扑对齐,而跨域测试在手机照片上的表现显示,模型对绳子外观的依赖性,突显了需要解决外观偏差的问题。
LensWalk: Agentic Video Understanding by Planning How You See in Videos
Authors: Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan
Venue: CVPR 2026
First: 2026-03-25T17:38:54+00:00 · Latest: 2026-03-25T17:38:54+00:00
Comments: To be published in CVPR 2026
Abstract
The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, the temporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent's evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5\% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.
中文标题/摘要
标题:LensWalk:通过规划如何观看视频实现自主视频理解
视频的密集和时间特性为自动化分析带来了巨大的挑战。尽管使用了强大的视觉-语言模型,现有的视频理解方法仍然受限于推理与感知之间的固有脱节:它们依赖于静态的、预先处理的信息,而不能在其理解过程中主动寻求视频中的原始证据。为了解决这一问题,我们提出了LensWalk,这是一种灵活的自主框架,使大型语言模型推理器能够主动控制其自身的视觉观察。LensWalk建立了一个紧密的推理-计划-观察循环,在每个步骤中,代理动态地指定其观察的视频的时间范围和采样密度。利用这些规范参数化的各种多功能视觉-语言模型工具,代理可以进行广泛的线索扫描,专注于特定段落进行事实提取,并从多个时刻拼接证据以实现整体验证。这种设计允许代理根据其不断发展的思维链进行渐进的、按需的证据收集。无需对任何模型进行微调,LensWalk在多个模型配方上实现了显著的即插即用性能提升,在具有挑战性的长视频基准测试LVBench和Video-MME上,其准确性提高了超过5%。我们的分析表明,使代理能够控制其如何观看是实现更准确、更稳健和更具可解释性的视频推理的关键。
Summary / 总结
LensWalk is designed to address the challenge of automated video understanding by allowing a Large Language Model to actively control its visual observations. It establishes a reason-plan-observe loop where the model dynamically specifies the temporal scope and sampling density of video observations. This enables progressive, on-demand evidence gathering that directly serves the model's evolving reasoning process. LensWalk improves the accuracy of various model recipes by over 5% on long-video benchmarks like LVBench and Video-MME without requiring any model fine-tuning.
LensWalk旨在通过让大型语言模型主动控制其视觉观察来解决自动视频理解的挑战。它建立了一个推理-计划-观察循环,模型可以在每一步动态指定视频观察的时间范围和采样密度。这使得可以逐步收集直接服务于模型不断演变的推理链的证据。LensWalk在LVBench和Video-MME等基准测试中提高了超过5%的准确性,而无需对模型进行微调。
Two-Time-Scale Learning Dynamics: A Population View of Neural Network Training
Authors: Giacomo Borghi, Hyesung Im, Lorenzo Pareschi
First: 2026-03-20T09:48:45+00:00 · Latest: 2026-03-25T17:38:13+00:00
Abstract
Population-based learning paradigms, including evolutionary strategies, Population-Based Training (PBT), and recent model-merging methods, combine fast within-model optimisation with slower population-level adaptation. Despite their empirical success, a general mathematical description of the resulting collective training dynamics remains incomplete. We introduce a theoretical framework for neural network training based on two-time-scale population dynamics. We model a population of neural networks as an interacting agent system in which network parameters evolve through fast noisy gradient updates of SGD/Langevin type, while hyperparameters evolve through slower selection--mutation dynamics. We prove the large-population limit for the joint distribution of parameters and hyperparameters and, under strong time-scale separation, derive a selection--mutation equation for the hyperparameter density. For each fixed hyperparameter, the fast parameter dynamics relaxes to a Boltzmann--Gibbs measure, inducing an effective fitness for the slow evolution. The averaged dynamics connects population-based learning with bilevel optimisation and classical replicator--mutator models, yields conditions under which the population mean moves toward the fittest hyperparameter, and clarifies the role of noise and diversity in balancing optimisation and exploration. Numerical experiments illustrate both the large-population regime and the reduced two-time-scale dynamics, and indicate that access to the effective fitness, either in closed form or through population-level estimation, can improve population-level updates.
中文标题/摘要
标题:双时间尺度学习动力学:基于群体的神经网络训练的群体视角
基于群体的学习范式,包括进化策略、群体基于训练(PBT)以及最近的模型合并方法,结合了快速的模型内部优化与较慢的群体层面适应。尽管它们在经验上取得了成功,但对由此产生的集体训练动力学的通用数学描述仍然不完整。我们引入了一种基于双时间尺度群体动力学的神经网络训练理论框架。我们将神经网络群体建模为一个相互作用的代理系统,在该系统中,网络参数通过快速的SGD/ Langevin类型噪声梯度更新快速演化,而超参数则通过较慢的选择-突变动力学缓慢演化。我们证明了参数和超参数联合分布的大群体极限,并在强时间尺度分离下推导出超参数密度的选择-突变方程。对于每个固定的超参数,快速参数动力学收敛到玻尔兹曼-吉布斯测度,从而诱导出慢演化中的有效适应度。平均动力学将基于群体的学习与二阶优化和经典的复制-突变模型联系起来,并给出了群体均值向最适应超参数移动的条件,以及噪声和多样性在优化和探索之间平衡中的作用。数值实验既说明了大群体的极限,也说明了简化后的双时间尺度动力学,并表明对有效适应度的访问,无论是以闭式形式还是通过群体级估计,都可以改善群体级更新。
Summary / 总结
This paper introduces a theoretical framework for understanding the training dynamics of neural networks using two-time-scale population dynamics. It models a population of neural networks where parameters are updated through fast noisy gradient updates, while hyperparameters evolve through slower selection-mutation dynamics. The framework proves the large-population limit and derives a selection-mutation equation for hyperparameter density, showing how the population mean moves toward the fittest hyperparameter. Numerical experiments demonstrate the effectiveness of this approach in improving population-level updates by accessing the effective fitness either in closed form or through population-level estimation.
论文提出了一种基于两时间尺度群体动力学的神经网络训练动力学理论框架。该模型将神经网络群体分为参数通过快速噪声梯度更新快速变化和超参数通过较慢的选择-突变动态缓慢变化两个部分。该框架证明了大群体极限,并推导出超参数密度的选择-突变方程,表明在强时间尺度分离下,群体均值会趋向于最优的超参数。数值实验展示了该方法在提高群体级更新效果方面的有效性。
Evaluating Chunking Strategies For Retrieval-Augmented Generation in Oil and Gas Enterprise Documents
Authors: Samuel Taiwo, Mohd Amaluddin Yusoff
Venue: Computer Science and Information Technology (CS and IT), pp. 49-67, 2026
First: 2026-03-25T17:35:24+00:00 · Latest: 2026-03-25T17:35:24+00:00
Comments: Presented at CCSEIT 2026. This version matches the published proceedings
Abstract
Retrieval-Augmented Generation (RAG) has emerged as a framework to address the constraints of Large Language Models (LLMs). Yet, its effectiveness fundamentally hinges on document chunking - an often-overlooked determinant of its quality. This paper presents an empirical study quantifying performance differences across four chunking strategies: fixed-size sliding window, recursive, breakpoint-based semantic, and structure-aware. We evaluated these methods using a proprietary corpus of oil and gas enterprise documents, including text-heavy manuals, table-heavy specifications, and piping and instrumentation diagrams (P and IDs). Our findings show that structure-aware chunking yields higher overall retrieval effectiveness, particularly in top-K metrics, and incurs significantly lower computational costs than semantic or baseline strategies. Crucially, all four methods demonstrated limited effectiveness on P and IDs, underscoring a core limitation of purely text-based RAG within visually and spatially encoded documents. We conclude that while explicit structure preservation is essential for specialised domains, future work must integrate multimodal models to overcome current limitations.
中文标题/摘要
标题:评估石油和天然气企业文档中检索增强生成的分块策略
检索增强生成(RAG)已成为解决大型语言模型(LLMs)限制的框架。然而,其有效性在很大程度上取决于文档分块——一个经常被忽视的质量决定因素。本文通过实证研究量化了四种分块策略之间的性能差异:固定大小滑动窗口、递归、断点基于语义和结构感知。我们使用了石油和天然气企业文档的专有语料库进行评估,包括文本密集的手册、表格密集的规范以及管道和仪表图(P和IDs)。我们的研究结果表明,结构感知分块在总体检索效果上优于其他方法,特别是在前K项指标上,并且在计算成本上远低于语义或基线策略。至关重要的是,所有四种方法在P和IDs上都表现出有限的效果,突显了纯文本RAG在视觉和空间编码文档中的核心局限性。我们得出结论,虽然明确的结构保留对于专门领域至关重要,但未来的工作必须结合多模态模型以克服当前的局限性。
Summary / 总结
This paper evaluates four chunking strategies for Retrieval-Augmented Generation (RAG) in oil and gas enterprise documents, including fixed-size sliding window, recursive, breakpoint-based semantic, and structure-aware methods. Using a proprietary corpus, the study finds that structure-aware chunking outperforms others in retrieval effectiveness, especially in top-K metrics, while being more computationally efficient. However, all methods show limited effectiveness on P and IDs, highlighting the need for multimodal models in specialized domains.
本文评估了四种用于油和气企业文档的检索增强生成(RAG)的分块策略,包括固定大小滑动窗口、递归、断点基于语义和结构感知方法。使用专有语料库进行研究,发现结构感知分块在检索效果上有所提升,尤其是在top-K指标上表现更好,并且比语义或基线策略更节省计算成本。然而,所有方法在管道和仪表图上的效果有限,这突显了在视觉和空间编码文档中需要集成多模态模型的必要性。
Quantification and object perception in Multimodal Large Language Models and human linguistic cognition
Authors: Raquel Montero, Natalia Moskvina, Paolo Morosi, Tamara Serrano, Elena Pagliarini, Evelina Leivada
First: 2025-11-11T11:30:21+00:00 · Latest: 2026-03-25T17:29:07+00:00
Abstract
Quantification has been proven to be a particularly difficult linguistic phenomenon for (Multimodal) Large Language Models (MLLMs). However, given that quantification interfaces with the logic, pragmatic, and numerical domains, the exact reasons for the poor performance are still unclear. This paper looks at three key features of human quantification shared cross-linguistically that have remained so far unexplored in the (M)LLM literature: the ordering of quantifiers into scales, the ranges of use and prototypicality, and the biases inherent in the human approximate number system. The aim is to determine how these features are encoded in the models' architecture, how they may differ from humans, and whether the results are affected by the type of model (thinking vs. instruct) and the language under investigation. Results show that although thinking models showed a high accuracy in the numerosity estimation task and in the organization of quantifiers into scales, there are still key differences between humans and LLMs across all model types, particularly in terms of ranges of use and prototypicality values. This work, thus, paves the way for addressing the nature of MLLMs as semantic and pragmatic agents, while the cross-linguistic lens can elucidate whether their abilities are robust and stable across different languages.
中文标题/摘要
标题:多模态大型语言模型和人类语言认知中的量化与物体感知
量化已被证明是(多模态)大型语言模型(MLLMs)处理的一个特别困难的语义现象。然而,由于量化与逻辑、语用和数值领域相关,其表现不佳的确切原因仍然不清楚。本文探讨了跨语言共享的三个关于人类量化的关键特征,这些特征在(M)LLM文献中尚未被探索:量词的排序尺度、使用范围和原型性,以及人类近似数系统中固有的偏见。目标是确定这些特征如何编码在模型的架构中,它们与人类有何不同,以及结果是否受到模型类型(思考型 vs. 指令型)和研究语言的影响。结果表明,尽管思考型模型在数量估计任务和量词排序尺度上表现出高准确性,但在使用范围和原型性值方面,人类和LLM之间仍然存在关键差异。因此,这项工作为探讨MLLMs作为语义和语用代理的本质铺平了道路,而跨语言视角可以阐明它们的能力是否在不同语言中稳健且稳定。
Summary / 总结
This paper investigates why Multimodal Large Language Models (MLLMs) struggle with quantification, a complex linguistic phenomenon. It explores three key human quantification features: quantifier scales, use ranges, and biases from the approximate number system. The study finds that while thinking models perform well in numerosity estimation and quantifier scale organization, they still differ from humans in use ranges and prototypicality values, highlighting the need for MLLMs to better understand semantics and pragmatics.
本文研究了多模态大型语言模型(MLLMs)在处理量化这一复杂语言现象时面临的挑战。研究探讨了人类量化中的三个关键特征——量词的排序、使用范围以及近似数系统中的偏见——这些特征在MLLM文献中尚未得到充分关注。研究发现,虽然思考型模型在数量估计和量词规模组织方面表现良好,但在使用范围和原型值方面,人类和MLLM之间仍存在显著差异。研究强调了MLLM需要更好地模仿人类的语义和语用能力的必要性。
Relationship-Aware Safety Unlearning for Multimodal LLMs
Authors: Vishnu Narayanan Anilkumar, Abhijith Sreesylesh Babu, Trieu Hai Vo, Mohankrishna Kolla, Alexander Cuneo
First: 2026-03-15T02:22:26+00:00 · Latest: 2026-03-25T17:27:14+00:00
Comments: 9 pages,4figures
Abstract
Generative multimodal models can exhibit safety failures that are inherently relational: two benign concepts can become unsafe when linked by a specific action or relation (e.g., child-drinking-wine). Existing unlearning and concept-erasure approaches often target isolated concepts or image-text pairs, which can cause collateral damage to benign uses of the same objects and relations. We propose relationship-aware safety unlearning: a framework that explicitly represents unsafe object-relation-object (O-R-O) tuples and applies targeted parameter-efficient edits (LoRA) to suppress unsafe tuples while preserving object marginals and safe neighboring relations. We include CLIP-based experiments and robustness evaluation under paraphrase, contextual, and out-of-distribution image attacks.
中文标题/摘要
标题:关系感知的安全遗忘技术用于多模态LLMs
生成型多模态模型可能会表现出固有的关系性安全故障:两个原本无害的概念在特定动作或关系的连接下变得不安全(例如,孩子-饮酒-葡萄酒)。现有的遗忘和概念擦除方法通常针对孤立的概念或图像-文本对,这可能会对相同对象和关系的良性使用造成附带损害。我们提出了一种关系感知的安全遗忘框架:该框架明确表示不安全的对象-关系-对象(O-R-O)元组,并应用目标化的参数高效编辑(LoRA)来抑制不安全的元组同时保留对象边缘和安全的邻近关系。我们包括基于CLIP的实验以及在同义词、上下文和分布外图像攻击下的鲁棒性评估。
Summary / 总结
The research addresses the issue of relational safety failures in generative multimodal models, where certain object-relation-object tuples can become unsafe. The proposed method, relationship-aware safety unlearning, explicitly identifies these unsafe tuples and uses parameter-efficient edits (LoRA) to suppress them while maintaining the safety of related but benign tuples. Experiments show that this approach effectively mitigates safety issues without causing collateral damage to other uses of the same objects and relations.
研究旨在解决由于特定关系情境(如孩子喝酒)导致的多模态生成模型的安全问题。现有去学习方法通常会删除孤立的概念,导致意外的副作用。研究提出了一种关系感知的安全去学习框架,该框架针对不安全的对象-关系-对象三元组使用参数高效编辑(LoRA)来抑制不安全的情境,同时保持相关情境和对象边际的安全性。实验表明,该方法在各种攻击场景下有效缓解了安全问题,且不会造成意外损害。
Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection
Authors: Md Tanvir Rouf Shawon, Mohammad Sabik Irbaz, Hadeel R. A. Elyazori, Keerti Reddy Resapu, Yili Lin, Vladimir Franzuela Cardenas, Farrokh Alemi, Kevin Lybarger
First: 2026-02-11T21:53:18+00:00 · Latest: 2026-03-25T17:20:58+00:00
Abstract
Objective: This paper introduces a patient simulator for scalable, automated evaluation of healthcare conversational agents, generating realistic, controllable interactions that systematically vary across medical, linguistic, and behavioral dimensions to support risk assessment across populations. Methods: Grounded in the NIST AI Risk Management Framework, the simulator integrates three profile components: (1) medical profiles constructed from All of Us electronic health records using risk-ratio gating; (2) linguistic profiles modeling health literacy and condition-specific communication; and (3) behavioral profiles representing cooperative, distracted, and adversarial engagement. Profiles were evaluated against NIST AI RMF trustworthiness requirements and assessed against an AI Decision Aid for antidepressant selection. Results: Across 500 simulated conversations, the simulator revealed monotonic degradation in AI Decision Aid performance across health literacy levels: Rank-1 concept retrieval ranged from 47.6% (limited) to 81.9% (proficient), with corresponding recommendation degradation. Medical concept fidelity was high (96.6% across 8,210 concepts), validated by human annotators (0.73 kappa) and an LLM judge with comparable agreement (0.78 kappa). Behavioral profiles were reliably distinguished (0.93 kappa), and linguistic profiles showed moderate agreement (0.61 kappa). Conclusions: The simulator exposes measurable performance risks in conversational healthcare AI. Health literacy emerged as a primary risk factor with direct implications for equitable AI deployment.
中文标题/摘要
标题:通过患者模拟提升AI可信度:抗抑郁药物选择对话代理的风险评估
目的:本文介绍了一种患者模拟器,用于大规模、自动评估医疗对话代理,生成具有现实性和可控性的交互,系统地在医学、语言和行为维度上变化,以支持不同人群的风险评估。方法:基于NIST AI风险管理框架,模拟器整合了三个配置文件组件:(1)使用风险比筛选从All of Us电子健康记录构建的医学配置文件;(2)建模健康素养和条件特定沟通的语言配置文件;(3)代表合作、分心和对抗性参与的行为配置文件。配置文件根据NIST AI RMF可信度要求进行了评估,并与抗抑郁药物选择的人工智能决策辅助工具进行了评估。结果:在500次模拟对话中,模拟器揭示了AI决策辅助工具性能随健康素养水平单调下降:排名1概念检索率从有限的47.6%下降到熟练的81.9%,相应的推荐质量也下降。医学概念保真度高(8,210个概念的96.6%),由人类注释者(0.73 kappa)和具有类似一致性的LLM裁判(0.78 kappa)验证。行为配置文件可靠区分(0.93 kappa),语言配置文件显示中等一致(0.61 kappa)。结论:模拟器揭示了对话医疗AI的可测量性能风险。健康素养成为主要风险因素,直接影响公平的AI部署。
Summary / 总结
This paper presents a patient simulator to evaluate healthcare conversational agents, focusing on risk assessment for antidepressant selection. The simulator integrates medical, linguistic, and behavioral profiles to simulate realistic interactions. Across 500 conversations, the simulator showed that AI performance degraded with lower health literacy, with concept retrieval ranging from 47.6% to 81.9%. Medical concept fidelity was high, but linguistic and behavioral profiles showed moderate agreement. Health literacy emerged as a key risk factor for AI deployment in healthcare.
本文介绍了一种患者模拟器,用于评估医疗对话代理,在抗抑郁药物选择中集成医学、语言和行为档案以评估AI性能。该模拟器基于NIST AI风险管理框架,结果显示AI决策辅助工具的性能随健康素养降低而下降,从47.6%到81.9%不等。医学概念的高保真度得到了人类和LLM评判者的验证,而行为和语言档案分别表现出较高的可靠区分度和中等程度的一致性。
Analysing the Safety Pitfalls of Steering Vectors
Authors: Yuxiao Li, Alina Fastowski, Efstratios Zaradoukas, Bardh Prenkaj, Gjergji Kasneci
First: 2026-03-25T17:16:11+00:00 · Latest: 2026-03-25T17:16:11+00:00
Abstract
Activation steering has emerged as a powerful tool to shape LLM behavior without the need for weight updates. While its inherent brittleness and unreliability are well-documented, its safety implications remain underexplored. In this work, we present a systematic safety audit of steering vectors obtained with Contrastive Activation Addition (CAA), a widely used steering approach, under a unified evaluation protocol. Using JailbreakBench as benchmark, we show that steering vectors consistently influence the success rate of jailbreak attacks, with stronger amplification under simple template-based attacks. Across LLM families and sizes, steering the model in specific directions can drastically increase (up to 57%) or decrease (up to 50%) its attack success rate (ASR), depending on the targeted behavior. We attribute this phenomenon to the overlap between the steering vectors and the latent directions of refusal behavior. Thus, we offer a traceable explanation for this discovery. Together, our findings reveal the previously unobserved origin of this safety gap in LLMs, highlighting a trade-off between controllability and safety.
中文标题/摘要
标题:分析引导向量的安全陷阱
对比激活添加(CAA)已成为一种强大的工具,无需权重更新即可塑造LLM的行为。尽管其固有的脆弱性和不可靠性已得到充分记录,但其安全性影响仍被忽视。在本研究中,我们使用统一的评估协议对使用对比激活添加(CAA)广泛使用的引导方法获得的引导向量进行了系统性安全审计。使用JailbreakBench作为基准,我们展示了引导向量一致地影响越狱攻击的成功率,简单模板攻击下的放大效应尤为明显。在不同LLM家族和规模下,将模型引导到特定方向可以显著增加(高达57%)或减少(高达50%)其攻击成功率(ASR),具体取决于目标行为。我们将这一现象归因于引导向量与拒绝行为的潜在方向之间的重叠。因此,我们提供了这一发现的可追溯解释。总之,我们的研究揭示了LLM中此前未被观察到的安全缺口的起源,突显了可控性和安全性之间的权衡。
SEGAR: Selective Enhancement for Generative Augmented Reality
Authors: Fanjun Bu, Chenyang Yuan, Hiroshi Yasuda
First: 2026-03-25T17:15:45+00:00 · Latest: 2026-03-25T17:15:45+00:00
Abstract
Generative world models offer a compelling foundation for augmented-reality (AR) applications: by predicting future image sequences that incorporate deliberate visual edits, they enable temporally coherent, augmented future frames that can be computed ahead of time and cached, avoiding per-frame rendering from scratch in real time. In this work, we present SEGAR, a preliminary framework that combines a diffusion-based world model with a selective correction stage to support this vision. The world model generates augmented future frames with region-specific edits while preserving others, and the correction stage subsequently aligns safety-critical regions with real-world observations while preserving intended augmentations elsewhere. We demonstrate this pipeline in driving scenarios as a representative setting where semantic region structure is well defined and real-world feedback is readily available. We view this as an early step toward generative world models as practical AR infrastructure, where future frames can be generated, cached, and selectively corrected on demand.
中文标题/摘要
标题:SEGAR:选择性增强生成增强现实
生成的世界模型为增强现实(AR)应用提供了令人信服的基础:通过预测包含故意视觉编辑的未来图像序列,它们能够生成时空连贯的增强未来帧,这些帧可以在实时计算之前预先计算并缓存,从而避免每次帧都需要从头开始实时渲染。在本文中,我们提出了SEGAR,这是一种初步框架,结合了基于扩散的世界模型和一个选择性修正阶段,以支持这一愿景。世界模型生成具有区域特定编辑的增强未来帧,同时保留其他部分,而修正阶段随后将关键安全区域与现实世界观察结果对齐,同时在其他地方保留预期的增强效果。我们通过驾驶场景展示了这一管道,作为具有明确语义区域结构和实时反馈的代表性设置。我们认为这是生成的世界模型作为实用AR基础设施的早期步骤,其中未来帧可以按需生成、缓存和选择性修正。
CliPPER: Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition
Authors: Florian Stilz, Vinkle Srivastav, Nassir Navab, Nicolas Padoy
First: 2026-03-25T17:14:36+00:00 · Latest: 2026-03-25T17:14:36+00:00
Abstract
Video-language foundation models have proven to be highly effective in zero-shot applications across a wide range of tasks. A particularly challenging area is the intraoperative surgical procedure domain, where labeled data is scarce, and precise temporal understanding is often required for complex downstream tasks. To address this challenge, we introduce CliPPER (Contextual Video-Language Pretraining on Long-form Intraoperative Surgical Procedures for Event Recognition), a novel video-language pretraining framework trained on surgical lecture videos. Our method is designed for fine-grained temporal video-text recognition and introduces several novel pretraining strategies to improve multimodal alignment in long-form surgical videos. Specifically, we propose Contextual Video-Text Contrastive Learning (VTC_CTX) and Clip Order Prediction (COP) pretraining objectives, both of which leverage temporal and contextual dependencies to enhance local video understanding. In addition, we incorporate a Cycle-Consistency Alignment over video-text matches within the same surgical video to enforce bidirectional consistency and improve overall representation coherence. Moreover, we introduce a more refined alignment loss, Frame-Text Matching (FTM), to improve the alignment between video frames and text. As a result, our model establishes a new state-of-the-art across multiple public surgical benchmarks, including zero-shot recognition of phases, steps, instruments, and triplets. The source code and pretraining captions can be found at https://github.com/CAMMA-public/CliPPER.
中文标题/摘要
标题:CliPPER:针对长时手术过程的视频-语言预训练框架以识别事件
视频-语言基础模型在广泛的任务中已被证明具有高度有效性。特别具有挑战性的领域是手术过程领域,其中标注数据稀缺,且往往需要精确的时间理解以完成复杂的下游任务。为应对这一挑战,我们引入了CliPPER(针对长时手术过程的视频-语言预训练框架以识别事件),一种新型的视频-语言预训练框架,该框架基于手术讲座视频进行训练。我们的方法旨在进行细粒度的时间视频-文本识别,并引入了多种新的预训练策略以提高长时手术视频中的多模态对齐。具体来说,我们提出了上下文视频-文本对比学习(VTC_CTX)和剪辑顺序预测(COP)预训练目标,两者都利用了时间上下文依赖性以增强局部视频理解。此外,我们引入了视频-文本匹配的循环一致性对齐,以在同一个手术视频内增强双向一致性并提高整体表示的一致性。此外,我们引入了更精细的对齐损失,帧-文本匹配(FTM),以提高视频帧与文本之间的对齐。因此,我们的模型在多个公开的手术基准测试中建立了新的最佳水平,包括零样本识别阶段、步骤、器械和三元组。源代码和预训练字幕可在https://github.com/CAMMA-public/CliPPER获取。
Summary / 总结
The research aims to address the challenge of precise temporal understanding in the intraoperative surgical procedure domain, where labeled data is scarce. The authors introduce CliPPER, a video-language pretraining framework that leverages surgical lecture videos for fine-grained temporal video-text recognition. Key pretraining strategies include Contextual Video-Text Contrastive Learning (VTC_CTX), Clip Order Prediction (COP), and Frame-Text Matching (FTM) to enhance multimodal alignment. The model outperforms existing methods on multiple public surgical benchmarks, achieving state-of-the-art results in zero-shot recognition of phases, steps, instruments, and triplets.
研究旨在解决手术过程领域中精确的时间理解挑战,该领域标注数据稀缺。方法CliPPER引入了多种新颖的预训练策略,包括上下文视频-文本对比学习和剪辑顺序预测,以增强长视频中的多模态对齐。该模型在多个公开的手术基准测试中实现了零样本识别阶段、步骤、器械和三元组的新最佳性能。
Robust Multilingual Text-to-Pictogram Mapping for Scalable Reading Rehabilitation
Authors: Soufiane Jhilal, Martina Galletti
First: 2026-03-25T17:12:14+00:00 · Latest: 2026-03-25T17:12:14+00:00
Abstract
Reading comprehension presents a significant challenge for children with Special Educational Needs and Disabilities (SEND), often requiring intensive one-on-one reading support. To assist therapists in scaling this support, we developed a multilingual, AI-powered interface that automatically enhances text with visual scaffolding. This system dynamically identifies key concepts and maps them to contextually relevant pictograms, supporting learners across languages. We evaluated the system across five typologically diverse languages (English, French, Italian, Spanish, and Arabic), through multilingual coverage analysis, expert clinical review by speech therapists and special education professionals, and latency assessment. Evaluation results indicate high pictogram coverage and visual scaffolding density across the five languages. Expert audits suggested that automatically selected pictograms were semantically appropriate, with combined correct and acceptable ratings exceeding 95% for the four European languages and approximately 90% for Arabic despite reduced pictogram repository coverage. System latency remained within interactive thresholds suitable for real-time educational use. These findings support the technical viability, semantic safety, and acceptability of automated multimodal scaffolding to improve accessibility for neurodiverse learners.
中文标题/摘要
标题:稳健的多语言文本到图示映射技术以实现可扩展的阅读康复
阅读理解对有特殊教育需要和障碍(SEND)的儿童来说是一个重大挑战,通常需要密集的一对一阅读支持。为了帮助治疗师扩大这种支持,我们开发了一个多语言、基于人工智能的界面,该界面能够自动为文本添加视觉辅助。该系统动态识别关键概念,并将它们映射到上下文相关的图示,支持跨语言的学习者。我们通过多语言覆盖分析、言语治疗师和特殊教育专业人士的专家临床审查以及延迟评估,在五种类型学上不同的语言(英语、法语、意大利语、西班牙语和阿拉伯语)上评估了该系统。评估结果表明,该系统在五种语言中具有高图示覆盖率和视觉辅助密度。专家审查表明,自动选择的图示在语义上是合适的,对于四种欧洲语言,正确和可接受的评分超过95%,尽管阿拉伯语由于图示资源库覆盖不足,评分约为90%。系统延迟保持在适合实时教育使用的交互阈值内。这些发现支持自动多模态辅助技术的技术可行性、语义安全性和可接受性,以提高神经多样性学习者的可访问性。
Summary / 总结
The research aims to address the challenge of reading comprehension for children with SEND by developing an AI-powered multilingual interface that enhances text with pictograms. The system dynamically maps key concepts to relevant pictograms, supporting learners across five languages. Evaluation through multilingual coverage analysis, clinical review, and latency assessment showed high pictogram coverage and visual scaffolding density, with expert audits indicating semantically appropriate pictograms and system latency suitable for real-time educational use.
研究旨在通过开发一种多语言AI辅助界面来解决特殊教育需求(SEND)儿童的阅读理解挑战,该界面通过视觉辅助增强文本。系统动态地将关键概念映射到上下文相关的图示,覆盖五种语言(英语、法语、意大利语、西班牙语和阿拉伯语)。通过多语言覆盖率分析、专家临床审查和延迟评估,结果显示高图示覆盖率和视觉辅助密度,欧洲语言的正确和可接受评分超过95%,阿拉伯语约为90%,尽管图示资源库覆盖率较低。系统的延迟在互动阈值内,适合实时教育使用,支持其技术可行性和对神经多样性学习者的接受性。
Representation Learning to Study Temporal Dynamics in Tutorial Scaffolding
Authors: Conrad Borchers, Jiayi Zhang, Ashish Gurung
First: 2026-03-25T17:11:56+00:00 · Latest: 2026-03-25T17:11:56+00:00
Comments: Accepted as short paper to the 27th International Conference on Artificial Intelligence in Education (AIED 2026)
Abstract
Adaptive scaffolding enhances learning, yet the field lacks robust methods for measuring it within authentic tutoring dialogue. This gap has become more pressing with the rise of remote human tutoring and large language model-based systems. We introduce an embedding-based approach that analyzes scaffolding dynamics by aligning the semantics of dialogue turns, problem statements, and correct solutions. Specifically, we operationalize alignment by computing cosine similarity between tutor and student contributions and task-relevant content. We apply this framework to 1,576 real-world mathematics tutoring dialogues from the Eedi Question Anchored Tutoring Dialogues dataset. The analysis reveals systematic differences in task alignment and distinct temporal patterns in how participants ground their contributions in problem and solution content. Further, mixed-effects models show that role-specific semantic alignment predicts tutorial progression beyond baseline features such as message order and length. Tutor contributions exhibited stronger grounding in problem content early in interactions. In contrast, student solution alignment was modestly positively associated with progression. These findings support scaffolding as a continuous, role-sensitive process grounded in task semantics. By capturing role-specific alignment over time, this approach provides a principled method for analyzing instructional dialogue and evaluating conversational tutoring systems.
中文标题/摘要
标题:基于表示学习研究辅导支架的时间动态
自适应支架可以增强学习,但领域内缺乏在真实辅导对话中衡量其效果的稳健方法。随着远程人类辅导和基于大型语言模型的系统的兴起,这一差距变得更为紧迫。我们提出了一种基于嵌入的方法,通过对对话轮次、问题陈述和正确解答的语义进行对齐来分析支架动态。具体而言,我们通过计算辅导者和学生贡献与任务相关内容之间的余弦相似度来实现对齐。我们应用此框架分析了来自Eedi问题锚定辅导对话数据集的1,576个真实世界数学辅导对话。分析揭示了任务对齐的系统性差异以及参与者如何在问题和解决方案内容中定位其贡献的明显时间模式。进一步的混合效应模型表明,角色特定的语义对齐可以预测辅导进程,超越了消息顺序和长度等基线特征。辅导者的贡献在互动初期更加强调对问题内容的定位。相比之下,学生解决方案的对齐与进程呈轻微正相关。这些发现支持支架作为连续、角色敏感的过程,其基础在于任务语义。通过捕捉时间上的角色特定对齐,这种方法为分析教学对话和评估对话式辅导系统提供了一种原则性的方法。
Summary / 总结
This study addresses the need for robust methods to measure adaptive scaffolding in tutoring dialogues, especially in remote settings. It introduces an embedding-based approach to analyze the temporal dynamics of scaffolding by aligning the semantics of tutor and student contributions. Applying this method to 1,576 real-world mathematics tutoring dialogues, the research finds that tutor contributions are more aligned with problem content early in interactions, while student solution alignment is positively associated with tutorial progression. Mixed-effects models show that role-specific semantic alignment predicts tutorial progression better than message order and length.
该研究旨在解决在远程辅导环境中评估适应性支架方法的不足,引入了一种基于嵌入的方法来分析支架动态,通过对对话轮次、问题陈述和正确解答的语义进行对齐。对1,576个实际数学辅导对话的分析显示,辅导者在互动初期更倾向于将对话内容与问题内容对齐,而学生解答的对齐与辅导进程呈正相关。混合效应模型表明,角色特定的语义对齐比消息顺序和长度更能预测辅导进程。
UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience
Authors: Zichuan Lin, Feiyu Liu, Yijun Yang, Jiafei Lyu, Yiming Gao, Yicheng Liu, Zhicong Lu, Yangbin Yu, Mingyu Yang, Junyou Li, Deheng Ye, Jie Jiang
First: 2026-03-25T17:10:29+00:00 · Latest: 2026-03-25T17:10:29+00:00
Comments: Code and models are available at https://github.com/ui-voyager/UI-Voyager
Abstract
Autonomous mobile GUI agents have attracted increasing attention along with the advancement of Multimodal Large Language Models (MLLMs). However, existing methods still suffer from inefficient learning from failed trajectories and ambiguous credit assignment under sparse rewards for long-horizon GUI tasks. To that end, we propose UI-Voyager, a novel two-stage self-evolving mobile GUI agent. In the first stage, we employ Rejection Fine-Tuning (RFT), which enables the continuous co-evolution of data and models in a fully autonomous loop. The second stage introduces Group Relative Self-Distillation (GRSD), which identifies critical fork points in group rollouts and constructs dense step-level supervision from successful trajectories to correct failed ones. Extensive experiments on AndroidWorld show that our 4B model achieves an 81.0% Pass@1 success rate, outperforming numerous recent baselines and exceeding human-level performance. Ablation and case studies further verify the effectiveness of GRSD. Our method represents a significant leap toward efficient, self-evolving, and high-performance mobile GUI automation without expensive manual data annotation.
中文标题/摘要
标题:UI-Voyager:一种通过失败经验自我进化的GUI代理
随着多模态大型语言模型(MLLM)的发展,自主移动GUI代理引起了越来越多的关注。然而,现有的方法仍然在从失败轨迹中学习和稀疏奖励下长时间GUI任务的模糊信用分配方面效率低下。为此,我们提出了一种新颖的两阶段自我进化的移动GUI代理UI-Voyager。在第一阶段,我们采用拒绝微调(RFT),这使得数据和模型在完全自主的循环中持续共同进化。第二阶段引入了组相对自我蒸馏(GRSD),它识别出组展开中的关键分叉点,并从成功的轨迹中构建密集的步骤级监督,以纠正失败的轨迹。在AndroidWorld上的广泛实验表明,我们的4B模型实现了81.0%的Pass@1成功率,优于众多最近的基线,并超过了人类水平的性能。消融和案例研究进一步验证了GRSD的有效性。我们的方法代表了向高效、自我进化的高性能移动GUI自动化迈出的重要一步,无需昂贵的手动数据注释。
Summary / 总结
UI-Voyager is a novel self-evolving mobile GUI agent that addresses the inefficiencies in learning from failed trajectories and ambiguous credit assignment. It consists of two stages: Rejection Fine-Tuning for continuous co-evolution of data and models, and Group Relative Self-Distillation for identifying critical points and providing dense supervision to correct failures. Experiments on AndroidWorld show that UI-Voyager's 4B model achieves an 81.0% Pass@1 success rate, surpassing recent baselines and human-level performance.
UI-Voyager 是一种新型的自进化的移动GUI代理,旨在解决从失败轨迹学习效率低下和信用分配模糊的问题。它分为两个阶段:拒绝微调实现数据和模型的连续共进化,以及组相对自我蒸馏识别关键点并提供密集监督以纠正失败。在AndroidWorld上的实验显示,UI-Voyager的4B模型实现了81.0%的Pass@1成功率,超越了最近的基线和人类水平的表现。
Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification
Authors: Dipam Goswami, Simone Magistri, Gido M. van de Ven, Bartłomiej Twardowski, Andrew D. Bagdanov, Tinne Tuytelaars, Joost van de Weijer
First: 2026-03-25T17:04:43+00:00 · Latest: 2026-03-25T17:04:43+00:00
Comments: Preprint
Abstract
Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the training set are an important source of information. In this work we investigate the impact of directly mixing image and text prototypes for few-shot classification and analyze this from a bias-variance perspective. We show that mixing prototypes acts like a shrinkage estimator. Although mixed prototypes improve classification performance, the image prototypes still add some noise in the form of instance-specific background or context information. In order to capture only information from the image space relevant to the given classification task, we propose projecting image prototypes onto the principal directions of the semantic text embedding space to obtain a text-aligned semantic image subspace. These text-aligned image prototypes, when mixed with text embeddings, further improve classification. However, for downstream datasets with poor cross-modal alignment in CLIP, semantic alignment might be suboptimal. We show that the image subspace can still be leveraged by modeling the anisotropy using class covariances. We demonstrate that combining a text-aligned mixed prototype classifier and an image-specific LDA classifier outperforms existing methods across few-shot classification benchmarks.
中文标题/摘要
标题:跨模态原型对齐与混合以实现无需训练的少样本分类
视觉-语言模型(VLMs)如CLIP是通过使文本和图像配对对齐来训练的。为了提高基于CLIP的少样本图像分类性能,最近的研究发现,除了文本嵌入外,训练集中的图像嵌入也是重要信息来源之一。在本文中,我们研究了直接将图像和文本原型混合用于少样本分类的影响,并从偏差-方差角度进行了分析。我们展示了混合原型类似于收缩估计器。尽管混合原型可以提高分类性能,但图像原型仍然会以实例特定的背景或上下文信息的形式添加一些噪声。为了仅捕获与给定分类任务相关的图像空间信息,我们提出将图像原型投影到语义文本嵌入空间的主要方向上,以获得一个文本对齐的语义图像子空间。这些文本对齐的图像原型,当与文本嵌入混合时,进一步提高了分类性能。然而,对于CLIP中跨模态对齐较差的下游数据集,语义对齐可能不是最优的。我们展示了可以通过建模类协方差来利用图像子空间的各向异性。我们证明了结合一个文本对齐的混合原型分类器和一个图像特定的LDA分类器在少样本分类基准上优于现有方法。
Summary / 总结
This work investigates the impact of directly mixing image and text prototypes for few-shot classification using vision-language models like CLIP. It proposes projecting image prototypes onto the principal directions of the semantic text embedding space to obtain a text-aligned semantic image subspace, which further improves classification performance. The method outperforms existing approaches across various few-shot classification benchmarks.
该研究探讨了直接混合图像和文本原型对少量样本分类的影响,结果显示虽然混合原型提高了性能,但引入了来自实例特定背景或上下文信息的噪声。为了解决这一问题,作者提出将图像原型投影到语义文本嵌入空间的主要方向上,以获得对齐的语义图像子空间,这进一步提高了分类性能。对于跨模态对齐较差的数据集,作者使用类协方差建模各向异性,并展示了结合对齐的混合原型分类器和图像特定的LDA分类器优于现有方法。
Linguistic Comparison of AI- and Human-Written Responses to Online Mental Health Queries
Authors: Koustuv Saha, Yoshee Jain, Violeta J. Rodriguez, Munmun De Choudhury
Venue: npj Artificial Intelligence, 2026
First: 2025-04-12T16:20:02+00:00 · Latest: 2026-03-25T17:04:34+00:00
Abstract
The ubiquity and widespread use of digital and online technologies have transformed mental health support, with online mental health communities (OMHCs) providing safe spaces for peer support. More recently, generative AI and large language models (LLMs) have introduced new possibilities for scalable, around-the-clock mental health assistance that could potentially augment and supplement the capabilities of OMHCs. Although genAI shows promise in delivering immediate and personalized responses, its effectiveness in replicating the nuanced, experience-based support of human peers remains an open question. In this study, we harnessed 24,114 posts and 138,758 online community (OC) responses from 55 OMHCs on Reddit. We prompted several state-of-the-art LLMs (GPT-4-Turbo, Llama-3, and Mistral-7B) with these posts, and compared their responses to human-written (OC) responses based on a variety of linguistic measures across psycholinguistics and lexico-semantics. Our findings revealed that AI responses are more verbose, readable, and analytically structured, but lack linguistic diversity and personal narratives inherent in human--human interactions. Through a qualitative examination, we found validation as well as complementary insights into the nature of AI responses, such as its neutral stance and the absence of seeking back-and-forth clarifications. We discuss the ethical and practical implications of integrating generative AI into OMHCs, advocating for frameworks that balance AI's scalability and timeliness with the irreplaceable authenticity, social interactiveness, and expertise of human connections that form the ethos of online support communities.
中文标题/摘要
标题:人工智能与人类撰写的在线心理健康查询回复的语言比较
数字和在线技术的普及和广泛应用已经改变了心理健康支持的方式,线上心理健康社区(OMHCs)提供了安全的空间供同伴支持。近年来,生成式AI和大规模语言模型(LLMs)为提供可扩展的、全天候的心理健康援助带来了新的可能性,这些援助可以补充和增强OMHCs的能力。尽管生成式AI在提供即时和个性化的回复方面显示出潜力,但其在复制人类同伴的细腻、基于经验的支持方面的有效性仍是一个开放的问题。在本研究中,我们利用来自Reddit上55个OMHCs的24,114个帖子和138,758个在线社区(OC)回复的数据。我们用这些帖子提示了几种最先进的LLM(GPT-4-Turbo、Llama-3和Mistral-7B),并将它们的回复与人类撰写的(OC)回复进行了基于心理语言学和词汇语义学的各种语言度量的比较。我们的研究发现,AI回复更为冗长、易读且结构化,但在语言多样性和人类互动中固有的个人叙事方面存在不足。通过定性的分析,我们发现了AI回复的验证以及补充性的见解,例如其中立的立场和缺乏寻求进一步澄清的情况。我们讨论了将生成式AI整合到OMHCs中的伦理和实践意义,倡导平衡AI的可扩展性和及时性与人类连接的不可替代的真实、社会互动性和专业知识,这些构成了在线支持社区的核心价值观。
Summary / 总结
This study compares AI- and human-written responses to online mental health queries by analyzing 24,114 posts and 138,758 responses from 55 Reddit OMHCs. The research finds that AI responses are more verbose, readable, and structured, but lack linguistic diversity and personal narratives. AI responses also exhibit a neutral stance and avoid back-and-forth clarifications. The study discusses the ethical and practical implications of integrating AI into online mental health communities, emphasizing the importance of human authenticity and social interactivity.
本研究比较了来自55个Reddit社区的24,114个帖子和138,758个人类回复中的AI和人类回复。AI回复更为冗长、易于阅读且结构化,但缺乏语言多样性及个人叙述。定性分析显示,AI回复保持中立且不寻求进一步澄清,强调了平衡AI的可扩展性和及时性与人类的真实性和社交互动性的伦理框架的重要性。
No Single Metric Tells the Whole Story: A Multi-Dimensional Evaluation Framework for Uncertainty Attributions
Authors: Emily Schiller, Teodor Chiaburu, Marco Zullich, Luca Longo
First: 2026-03-25T17:02:13+00:00 · Latest: 2026-03-25T17:02:13+00:00
Comments: Accepted at the Fourth World Conference on Explainable Artificial Intelligence, xAI 2026, Fortaleza, Brazil, July 1-3, 2026
Abstract
Research on explainable AI (XAI) has frequently focused on explaining model predictions. More recently, methods have been proposed to explain prediction uncertainty by attributing it to input features (uncertainty attributions). However, the evaluation of these methods remains inconsistent as studies rely on heterogeneous proxy tasks and metrics, hindering comparability. We address this by aligning uncertainty attributions with the well-established Co-12 framework for XAI evaluation. We propose concrete implementations for the correctness, consistency, continuity, and compactness properties. Additionally, we introduce conveyance, a property tailored to uncertainty attributions that evaluates whether controlled increases in epistemic uncertainty reliably propagate to feature-level attributions. We demonstrate our evaluation framework with eight metrics across combinations of uncertainty quantification and feature attribution methods on tabular and image data. Our experiments show that gradient-based methods consistently outperform perturbation-based approaches in consistency and conveyance, while Monte-Carlo dropconnect outperforms Monte-Carlo dropout in most metrics. Although most metrics rank the methods consistently across samples, inter-method agreement remains low. This suggests no single metric sufficiently evaluates uncertainty attribution quality. The proposed evaluation framework contributes to the body of knowledge by establishing a foundation for systematic comparison and development of uncertainty attribution methods.
中文标题/摘要
标题:单一指标无法全面描述:不确定性归因的多维度评估框架
可解释人工智能(XAI)的研究主要集中在解释模型预测上。最近,提出了通过将不确定性归因于输入特征来解释预测不确定性的方法。然而,这些方法的评估仍然不一致,因为研究依赖于不同的代理任务和指标,阻碍了可比性。我们通过将不确定性归因与XAI评估中成熟的Co-12框架对齐来解决这一问题。我们提出了正确性、一致性、连续性和紧凑性等具体实现。此外,我们引入了传达性这一特性,专门针对不确定性归因,评估控制增加的本体论不确定性是否可靠地传播到特征级归因中。我们使用表格和图像数据的不确定性量化和特征归因方法组合中的八个指标展示了我们的评估框架。实验表明,基于梯度的方法在一致性和传达性方面始终优于基于扰动的方法,而蒙特卡洛DropConnect在大多数指标中优于蒙特卡洛Dropout。尽管大多数指标在样本间一致地排名方法,但方法间的一致性仍然很低。这表明没有单一的指标能够充分评估不确定性归因的质量。提出的评估框架为系统比较和开发不确定性归因方法奠定了基础。
Summary / 总结
This paper addresses the inconsistency in evaluating uncertainty attributions in explainable AI by proposing a multi-dimensional evaluation framework aligned with the Co-12 framework. It introduces properties such as correctness, consistency, continuity, compactness, and conveyance, and evaluates eight metrics across different uncertainty quantification and feature attribution methods on tabular and image data. The study finds that gradient-based methods outperform perturbation-based approaches in consistency and conveyance, while Monte-Carlo dropconnect outperforms Monte-Carlo dropout in most metrics. However, no single metric can sufficiently evaluate the quality of uncertainty attributions across all samples, indicating the need for a comprehensive evaluation framework.
本文提出了一种多维度的评估框架,以解决不确定性归因在解释性人工智能中的评估不一致问题,该框架与Co-12框架对齐,并引入了正确性、一致性、连续性、紧凑性和传达性等属性。研究在表格和图像数据上对不同的不确定性量化和特征归因方法进行了八种指标的评估。研究发现,基于梯度的方法在一致性和传达性方面优于基于扰动的方法,而蒙特卡洛下采样连接法在大多数指标上优于蒙特卡洛下采样法。然而,没有单一的指标能够全面评估所有样本中的不确定性归因质量,这表明需要一个综合的评估框架。