arXiv 论文速递

2026-01-14 03:25
Snapshot: 20260114_0325
SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations
Authors: Mohammed Himayath Ali, Mohammed Aqib Abdullah, Mohammed Mudassir Uddin, Shahnawaz Alam
First: 2026-01-12T18:59:45+00:00 · Latest: 2026-01-12T18:59:45+00:00
Abstract
Large Language Models have emerged as transformative tools for Security Operations Centers, enabling automated log analysis, phishing triage, and malware explanation; however, deployment in adversarial cybersecurity environments exposes critical vulnerabilities to prompt injection attacks where malicious instructions embedded in security artifacts manipulate model behavior. This paper introduces SecureCAI, a novel defense framework extending Constitutional AI principles with security-aware guardrails, adaptive constitution evolution, and Direct Preference Optimization for unlearning unsafe response patterns, addressing the unique challenges of high-stakes security contexts where traditional safety mechanisms prove insufficient against sophisticated adversarial manipulation. Experimental evaluation demonstrates that SecureCAI reduces attack success rates by 94.7% compared to baseline models while maintaining 95.1% accuracy on benign security analysis tasks, with the framework incorporating continuous red-teaming feedback loops enabling dynamic adaptation to emerging attack strategies and achieving constitution adherence scores exceeding 0.92 under sustained adversarial pressure, thereby establishing a foundation for trustworthy integration of language model capabilities into operational cybersecurity workflows and addressing a critical gap in current approaches to AI safety within adversarial domains.
中文标题/摘要
标题:SecureCAI:在对抗性网络安全环境中具有注入抗性的LLM辅助工具
大型语言模型已成为安全运营中心的变革性工具,能够实现自动化日志分析、钓鱼处理和恶意软件解释;然而,在对抗性网络安全环境中部署时,模型暴露于提示注入攻击中,恶意指令嵌入安全数据中,操控模型行为。本文介绍了SecureCAI,这是一种新颖的防御框架,结合了安全意识护栏、自适应宪法进化和直接偏好优化以消除不安全的响应模式,解决了传统安全机制在高风险安全环境中对抗复杂对手操纵不足的问题。实验评估表明,与基线模型相比,SecureCAI将攻击成功率降低了94.7%,同时在良性安全分析任务上的准确率保持在95.1%;框架还集成了持续的红队反馈循环,以实现动态适应新兴攻击策略,并在持续的对抗压力下实现超过0.92的宪法合规性得分,从而为将语言模型能力安全地集成到运营网络安全工作流中奠定了基础,并解决了当前对抗性领域中AI安全方法的关键空白。
Summary / 总结
SecureCAI is a defense framework designed to protect large language models from prompt injection attacks in cybersecurity operations. It uses Constitutional AI principles with security-aware guardrails and adaptive constitution evolution to unlearn unsafe response patterns. SecureCAI significantly reduces attack success rates by 94.7% while maintaining high accuracy on benign tasks, and it continuously adapts to new attack strategies through red-teaming feedback loops, achieving high constitution adherence scores under adversarial pressure.
SecureCAI 是一种防御框架,旨在保护大型语言模型免受网络安全操作中的提示注入攻击。它使用宪法AI原则结合安全意识护栏、自适应宪法进化和直接偏好优化来消除不安全的响应模式。SecureCAI 将攻击成功率显著降低94.7%,同时在良性任务上保持95.1%的准确性,并通过持续的红队反馈循环动态适应新的攻击策略,从而在持续的 adversarial 压力下实现高宪法一致性得分。
Tuning-free Visual Effect Transfer across Videos
Authors: Maxwell Jones, Rameen Abdal, Or Patashnik, Ruslan Salakhutdinov, Sergey Tulyakov, Jun-Yan Zhu, Kuan-Chieh Jackson Wang
First: 2026-01-12T18:59:32+00:00 · Latest: 2026-01-12T18:59:32+00:00
Comments: Project Page: $\href{https://tuningfreevisualeffects-maker.github.io/Tuning-free-Visual-Effect-Transfer-across-Videos-Project-Page/}{this\ URL}$
Abstract
We present RefVFX, a new framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner. While existing methods excel at prompt-based or keyframe-conditioned editing, they struggle with dynamic temporal effects such as dynamic lighting changes or character transformations, which are difficult to describe via text or static conditions. Transferring a video effect is challenging, as the model must integrate the new temporal dynamics with the input video's existing motion and appearance. % To address this, we introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video depicting the transferred effect. Creating this data is non-trivial, especially the video-to-video effect triplets, which do not exist naturally. To generate these, we propose a scalable automated pipeline that creates high-quality paired videos designed to preserve the input's motion and structure while transforming it based on some fixed, repeatable effect. We then augment this data with image-to-video effects derived from LoRA adapters and code-based temporal effects generated through programmatic composition. Building on our new dataset, we train our reference-conditioned model using recent text-to-video backbones. Experimental results demonstrate that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference. See our website $\href{https://tuningfreevisualeffects-maker.github.io/Tuning-free-Visual-Effect-Transfer-across-Videos-Project-Page/}{at\ this\ URL}$.
中文标题/摘要
标题:无需调参的视频视觉效果转移
我们提出了一种名为RefVFX的新框架,该框架能够以端到端的方式将参考视频中的复杂时间效果转移到目标视频或图像上。现有方法在基于提示或关键帧条件的编辑方面表现出色,但在处理动态时间效果(如动态光照变化或角色变形)方面存在困难,这些效果难以通过文本或静态条件描述。将视频效果转移是一项挑战,因为模型必须将新的时间动态与输入视频的现有运动和外观相结合。为此,我们引入了一个大规模的三元组数据集,其中每个三元组包含一个参考效果视频、一个输入图像或视频以及一个显示转移效果的对应输出视频。创建这些数据并不容易,尤其是自然不存在的视频到视频效果三元组。为此,我们提出了一种可扩展的自动化管道,该管道可以生成高质量的配对视频,旨在保留输入的运动和结构,同时基于某些固定且可重复的效果进行转换。然后,我们使用LoRA适配器和代码生成的基于程序组合的时间效果对该数据集进行扩充。基于我们新构建的数据集,我们使用最新的文本到视频骨干网络训练参考条件模型。实验结果表明,RefVFX生成的编辑效果在视觉上一致且时间上连贯,能够跨未见过的效果类别泛化,并在定量指标和人类偏好方面优于仅基于提示的基线。请访问我们的网站:https://tuningfreevisualeffects-maker.github.io/Tuning-free-Visual-Effect-Transfer-across-Videos-Project-Page/
Optimal Learning Rate Schedule for Balancing Effort and Performance
Authors: Valentina Njaradi, Rodrigo Carrasco-Davis, Peter E. Latham, Andrew Saxe
First: 2026-01-12T18:59:07+00:00 · Latest: 2026-01-12T18:59:07+00:00
Abstract
Learning how to learn efficiently is a fundamental challenge for biological agents and a growing concern for artificial ones. To learn effectively, an agent must regulate its learning speed, balancing the benefits of rapid improvement against the costs of effort, instability, or resource use. We introduce a normative framework that formalizes this problem as an optimal control process in which the agent maximizes cumulative performance while incurring a cost of learning. From this objective, we derive a closed-form solution for the optimal learning rate, which has the form of a closed-loop controller that depends only on the agent's current and expected future performance. Under mild assumptions, this solution generalizes across tasks and architectures and reproduces numerically optimized schedules in simulations. In simple learning models, we can mathematically analyze how agent and task parameters shape learning-rate scheduling as an open-loop control solution. Because the optimal policy depends on expectations of future performance, the framework predicts how overconfidence or underconfidence influence engagement and persistence, linking the control of learning speed to theories of self-regulated learning. We further show how a simple episodic memory mechanism can approximate the required performance expectations by recalling similar past learning experiences, providing a biologically plausible route to near-optimal behaviour. Together, these results provide a normative and biologically plausible account of learning speed control, linking self-regulated learning, effort allocation, and episodic memory estimation within a unified and tractable mathematical framework.
中文标题/摘要
标题:平衡努力与表现的最佳学习率计划
如何高效学习是生物代理面临的基本挑战,也是人工代理日益关注的问题。为了有效学习,代理必须调节其学习速度,平衡快速改进的好处与努力、不稳定或资源使用的成本。我们提出了一种规范性框架,将此问题形式化为一个最优控制过程,在此过程中,代理最大化累积表现的同时承担学习成本。从这一目标出发,我们推导出最优学习率的闭式解,其形式为仅依赖于代理当前和预期未来表现的闭环控制器。在温和的假设下,该解在不同任务和架构下具有普适性,并在模拟中再现了数值优化的学习率计划。在简单的学习模型中,我们可以从数学上分析代理和任务参数如何塑造学习率调度作为开环控制解。由于最优策略依赖于对未来表现的预期,该框架预测了过度自信或欠自信如何影响参与度和坚持度,将学习速度的控制与自我调节学习理论联系起来。我们还展示了如何通过回忆类似过去的学经验来近似所需的性能预期的简单片段记忆机制,提供了一种生物学上可实现的接近最优行为的途径。这些结果共同提供了一种规范性和生物学上可实现的学习速度控制的解释,将自我调节学习、努力分配和片段记忆估计统一在一个可处理的数学框架中。
Summary / 总结
The paper addresses the challenge of learning rate scheduling in machine learning, aiming to balance the benefits of rapid improvement with the costs of effort and resource use. It formulates this as an optimal control problem and derives a closed-form solution for the optimal learning rate. The solution is tested in simulations and simple learning models, showing how agent and task parameters influence learning rate scheduling and linking it to self-regulated learning and episodic memory. The study provides a normative and biologically plausible framework for understanding learning speed control.
论文探讨了生物和人工代理的学习率调度问题,旨在平衡学习速度与性能和努力。它将这一问题形式化为最优控制问题,并推导出学习率的闭式解。该解在模拟和简单的学习模型中进行了测试,展示了它能够重现数值优化的调度,并将学习速度与自我调节学习和情景记忆估计联系起来。框架预测了过度自信或欠自信如何影响学习参与和持续性,提供了接近最优行为的生物可实现解释。
Reference Games as a Testbed for the Alignment of Model Uncertainty and Clarification Requests
Authors: Manar Ali, Judith Sieker, Sina Zarrieß, Hendrik Buschmeier
First: 2026-01-12T18:53:09+00:00 · Latest: 2026-01-12T18:53:09+00:00
Abstract
In human conversation, both interlocutors play an active role in maintaining mutual understanding. When addressees are uncertain about what speakers mean, for example, they can request clarification. It is an open question for language models whether they can assume a similar addressee role, recognizing and expressing their own uncertainty through clarification. We argue that reference games are a good testbed to approach this question as they are controlled, self-contained, and make clarification needs explicit and measurable. To test this, we evaluate three vision-language models comparing a baseline reference resolution task to an experiment where the models are instructed to request clarification when uncertain. The results suggest that even in such simple tasks, models often struggle to recognize internal uncertainty and translate it into adequate clarification behavior. This demonstrates the value of reference games as testbeds for interaction qualities of (vision and) language models.
中文标题/摘要
标题:将参考游戏作为模型不确定性与澄清请求对齐的测试平台
在人类对话中,双方都积极参与维持相互理解。当受话人对说话人意思不确定时,他们可以请求澄清。语言模型是否能承担类似受话人的角色,通过请求澄清来识别和表达自己的不确定性,仍是一个开放性问题。我们认为,参考游戏是一个很好的测试平台,因为它们是可控的、自包含的,并且使澄清需求明确且可测量。为了测试这一点,我们评估了三种视觉-语言模型,将基准参考解析任务与模型在不确定时被指示请求澄清的任务进行了比较。结果表明,即使在如此简单的任务中,模型也常常难以识别内部不确定性并将其转化为适当的澄清行为。这表明参考游戏作为测试语言模型(视觉和语言)交互质量的平台的价值。
Summary / 总结
The research aims to explore whether language models can recognize and express their own uncertainty by requesting clarification, similar to human conversation. The study uses reference games as a controlled testbed to evaluate three vision-language models under two conditions: a baseline reference resolution task and an experiment where models are instructed to request clarification when uncertain. The findings indicate that models often fail to recognize their internal uncertainty and translate it into appropriate clarification behavior, highlighting the need for better alignment of model uncertainty and clarification requests in language models.
研究旨在探索语言模型是否能够通过请求澄清来识别和表达自己的不确定性,类似于人类对话。研究使用参考游戏作为受控测试床,评估三种视觉-语言模型在两种条件下的表现:基线参考解析任务和一个实验条件,在此条件下模型在不确定时请求澄清。研究结果表明,模型往往无法识别其内部的不确定性并将其转化为适当的澄清行为,这突显了在语言模型中更好地对齐模型不确定性与澄清请求的重要性。
Learning the Value of Value Learning
Authors: Alex John London, Aydin Mohseni
First: 2025-11-21T19:06:30+00:00 · Latest: 2026-01-12T18:50:10+00:00
Comments: 19 pages, 6 figures, mathematical appendix
Abstract
Standard decision frameworks address uncertainty about facts but assume fixed options and values. We extend the Jeffrey-Bolker framework to model refinements in values and prove a value-of-information theorem for axiological refinement. In multi-agent settings, we establish that mutual refinement will characteristically transform zero-sum games into positive-sum interactions and yield Pareto-improvements in Nash bargaining. These results show that a framework of rational choice can be extended to model value refinement. By unifying epistemic and axiological refinement under a single formalism, we broaden the conceptual foundations of rational choice and illuminate the normative status of ethical deliberation.
中文标题/摘要
标题:学习价值学习的价值
标准决策框架处理关于事实的不确定性,但假设选项和价值固定不变。我们扩展了Jeffrey-Bolker框架以建模价值的细化,并证明了关于价值细化的信息价值定理。在多智能体环境中,我们证明相互细化通常会将零和博弈转变为正和互动,并产生纳什讨价还价中的帕累托改进。这些结果表明,一个理性的选择框架可以扩展以建模价值细化。通过将知识细化和价值细化统一到单一的形式化框架中,我们拓宽了理性选择的概念基础,并阐明了伦理讨论的规范地位。
CLAPS: Posterior-Aware Conformal Intervals via Last-Layer Laplace
Authors: Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh
First: 2025-12-01T07:58:21+00:00 · Latest: 2026-01-12T18:49:06+00:00
Comments: Revised for clarity and correctness; improved exposition and fixed minor issues
Abstract
We present CLAPS, a posterior-aware conformal regression method that pairs a Last-Layer Laplace Approximation with split-conformal calibration. From the resulting Gaussian posterior, CLAPS defines a simple two-sided posterior CDF score that aligns the conformity metric with the full predictive shape, not just a point estimate. This alignment can yield substantially narrower prediction intervals at a fixed target coverage, particularly on small to medium tabular datasets where data are scarce and uncertainty modeling is informative. We also provide a lightweight diagnostic suite that separates aleatoric and epistemic components and visualizes posterior behavior, helping practitioners assess when and why intervals shrink. Across multiple benchmarks using the same MLP backbone, CLAPS achieves nominal coverage and offers the most efficient intervals on small to medium datasets with mild heterogeneity, while remaining competitive and diagnostically transparent on large-scale heterogeneous data where Normalized-CP and CQR attain the tightest intervals.
中文标题/摘要
标题:CLAPS:基于后验的最后层拉普拉斯逼近与分割校准的区间方法
我们提出了CLAPS,一种基于后验的回归方法,结合了最后层拉普拉斯逼近和分割校准。从得到的高斯后验中,CLAPS 定义了一个简单的双侧后验CDF分数,使一致性度量与完整的预测形状对齐,而不仅仅是点估计。这种对齐可以在固定目标覆盖率下显著减小预测区间,特别是在小到中型表格数据集中,数据稀缺且不确定性建模是有信息性的。我们还提供了一个轻量级的诊断套件,将 aleatoric 和 epistemic 组件分离并可视化后验行为,帮助实践者评估区间缩小的原因和时机。在使用相同MLP骨干网络的多个基准测试中,CLAPS 达到名义覆盖率,并在小到中型具有轻微异方性的数据集上提供最高效的区间,同时在大规模异方性数据集上保持竞争力和诊断透明度,而 Normalized-CP 和 CQR 获得了最紧的区间。
Summary / 总结
CLAPS is a posterior-aware conformal regression method that uses Last-Layer Laplace Approximation with split-conformal calibration to generate prediction intervals. It defines a two-sided posterior CDF score that aligns the conformity metric with the full predictive distribution, leading to narrower intervals at a fixed coverage on small to medium datasets. CLAPS also includes a diagnostic suite to separate and visualize aleatoric and epistemic uncertainties, aiding in the assessment of interval shrinkage. On benchmarks, CLAPS achieves nominal coverage and offers the most efficient intervals on small to medium datasets, while remaining competitive on large-scale heterogeneous data.
CLAPS 是一种后验感知的回归方法,通过使用 Last-Layer Laplace 近似与分劈校准来生成预测区间。通过将一致性度量与完整的预测分布对齐,CLAPS 可以在固定覆盖率水平下生成更窄的区间,尤其是在小型到中型的表格数据集上。它还包含一个诊断套件,用于分离和可视化 aleatoric 和 epistemic 不确定性,帮助从业者理解区间缩小的原因。在使用相同 MLP 基础模型的基准测试中,CLAPS 达到名义覆盖率,并在小型到中型数据集上提供最高效的区间,同时在大规模异质数据集上仍然具有竞争力。
ORACLE: Explaining Feature Interactions in Neural Networks with ANOVA
Authors: Dongseok Kim, Hyoungsun Choi, Mohamed Jismy Aashik Rasool, Gisung Oh
First: 2025-09-13T14:44:45+00:00 · Latest: 2026-01-12T18:46:44+00:00
Comments: v4: Revised for clarity and correctness; improved exposition and fixed minor issues
Abstract
We introduce ORACLE, a framework for explaining neural networks on tabular data and scientific factorial designs. ORACLE summarizes a trained network's prediction surface with main effects and pairwise interactions by treating the network as a black-box response, discretizing the inputs onto a grid, and fitting an orthogonal factorial (ANOVA-style) surrogate -- the $L^2$ orthogonal projection of the model response onto a finite-dimensional factorial subspace. A simple centering and $μ$-rebalancing step then expresses this surrogate as main- and interaction-effect tables that remain faithful to the original model in the $L^2$ sense. The resulting grid-based interaction maps are easy to visualize, comparable across backbones, and directly aligned with classical design-of-experiments practice. On synthetic factorial benchmarks and low- to medium-dimensional tabular regression tasks, ORACLE more accurately recovers ground-truth interaction structure and hotspots than Monte Carlo SHAP-family interaction methods, as measured by ranking, localization, and cross-backbone stability. We also discuss its scope in latent image and text settings: grid-based factorial surrogates are most effective when features admit an interpretable factorial structure, making ORACLE particularly well-suited to scientific and engineering workflows that require stable DoE-style interaction summaries.
中文标题/摘要
标题:ORACLE:使用ANOVA解释神经网络中的特征交互
我们引入了ORACLE框架,用于在表格数据和科学因子设计上解释神经网络。ORACLE通过将网络视为黑盒响应,将输入离散化到网格上,并拟合正交因子(ANOVA风格)的替代模型——模型响应在有限维因子子空间上的$L^2$正交投影,来总结训练网络的预测表面,包括主效应和两两交互。通过简单的中心化和$μ$重新平衡步骤,将此替代模型表示为主效应和交互效应表,这些表在$L^2$意义上忠实于原始模型。基于网格的交互图易于可视化,可以在不同骨干网络之间进行比较,并直接与经典的设计实验实践对齐。在合成因子基准和低至中维表格回归任务上,ORACLE比蒙特卡洛SHAP族交互方法更准确地恢复了真实交互结构和热点,这通过排名、定位和跨骨干网络稳定性来衡量。我们还讨论了其在潜在图像和文本设置中的适用范围:基于网格的因子替代模型最有效的情况是特征允许可解释的因子结构,这使ORACLE特别适合需要稳定的设计实验风格交互总结的科学和工程工作流程。
Summary / 总结
ORACLE is a framework for explaining feature interactions in neural networks by treating them as black-box responses and fitting an orthogonal factorial surrogate. It discretizes inputs onto a grid and projects the model response onto a factorial subspace, resulting in main and interaction-effect tables that are faithful to the original model. ORACLE outperforms Monte Carlo SHAP-family methods in recovering ground-truth interaction structures and hotspots on synthetic benchmarks and tabular regression tasks, as measured by ranking, localization, and cross-backbone stability.
ORACLE 是一个框架,用于通过将神经网络视为黑盒响应并拟合正交因子近似来解释特征交互。它将输入离散化到网格上,并将模型响应投影到因子子空间中,从而生成与原始模型忠实的主效应和交互效应表。ORACLE 在合成基准和表格回归任务上的排名、定位和跨骨干稳定性方面优于蒙特卡洛 SHAP 家族方法,能够更准确地恢复真实的交互结构和热点。
More Images, More Problems? A Controlled Analysis of VLM Failure Modes
Authors: Anurag Das, Adrian Bulat, Alberto Baldrati, Ioannis Maniadis Metaxas, Bernt Schiele, Georgios Tzimiropoulos, Brais Martinez
First: 2026-01-12T18:45:13+00:00 · Latest: 2026-01-12T18:45:13+00:00
Comments: 19 pages, 16 figures
Abstract
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored. While existing benchmarks have initiated the evaluation of multi-image models, a comprehensive analysis of their core weaknesses and their causes is still lacking. In this work, we introduce MIMIC (Multi-Image Model Insights and Challenges), a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs. Using MIMIC, we conduct a series of diagnostic experiments that reveal pervasive issues: LVLMs often fail to aggregate information across images and struggle to track or attend to multiple concepts simultaneously. To address these failures, we propose two novel complementary remedies. On the data side, we present a procedural data-generation strategy that composes single-image annotations into rich, targeted multi-image training examples. On the optimization side, we analyze layer-wise attention patterns and derive an attention-masking scheme tailored for multi-image inputs. Experiments substantially improved cross-image aggregation, while also enhancing performance on existing multi-image benchmarks, outperforming prior state of the art across tasks. Data and code will be made available at https://github.com/anurag-198/MIMIC.
中文标题/摘要
标题:更多图像,更多问题?对VLM失败模式的控制分析
大型视觉语言模型(LVLMs)展示了卓越的能力,但它们在理解和推理多个图像方面的熟练程度仍鲜有探索。尽管现有基准已经启动了多图像模型的评估,但对其核心弱点及其原因的全面分析仍然缺乏。在本文中,我们引入了MIMIC(多图像模型见解与挑战),这是一个新的基准,旨在严格评估LVLMs的多图像能力。使用MIMIC,我们进行了一系列诊断实验,揭示了普遍存在的问题:LVLMs经常无法在图像间汇总信息,并且难以同时跟踪或关注多个概念。为解决这些失败,我们提出了两种新的互补补救措施。在数据方面,我们提出了一种过程化的数据生成策略,将单图像注释组合成丰富的、有针对性的多图像训练示例。在优化方面,我们分析了逐层注意力模式,并推导出一种针对多图像输入的注意力掩蔽方案。实验显著提高了跨图像聚合能力,同时也在现有多图像基准测试中提高了性能,超越了先前的最先进水平。数据和代码将在https://github.com/anurag-198/MIMIC上提供。
Summary / 总结
This study addresses the limitations of Large Vision Language Models (LVLMs) in handling multiple images by introducing MIMIC, a new benchmark. The research reveals that LVLMs struggle to aggregate information across images and have difficulty tracking multiple concepts simultaneously. To improve these capabilities, the authors propose a procedural data-generation strategy and an attention-masking scheme, which significantly enhance cross-image aggregation and outperform previous state-of-the-art models on multi-image benchmarks.
该研究通过引入MIMIC新基准,解决了大型视觉语言模型(LVLM)在处理多张图片时的局限性。通过诊断实验,研究发现LVLM难以在图片间聚合信息和同时跟踪多个概念。为改善这些能力,作者提出了一种程序化数据生成策略和注意力掩码方案,从而提高了跨图片聚合能力和在现有多图片基准上的表现,超越了之前的最先进模型在多种任务中的表现。
MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources
Authors: Huu Nguyen, Victor May, Harsh Raj, Marianna Nezhurina, Yishan Wang, Yanqi Luo, Minh Chien Vu, Taishi Nakamura, Ken Tsui, Van Khue Nguyen, David Salinas, Aleksandra Krasnodębska, Christoph Schuhmann, Mats Leon Richter, Xuan-Son, Vu, Jenia Jitsev
First: 2025-09-29T21:40:10+00:00 · Latest: 2026-01-12T18:44:30+00:00
Comments: Code: \url{https://github.com/ontocord/mixturevitae}
Abstract
We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong downstream performance. MixtureVitae follows a permissive-first, risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources). MixtureVitae adopts a simple, single-stage pretraining recipe that integrates a large proportion of permissive synthetic instruction and reasoning data-signals typically introduced during post-training and generally scarce in permissive web corpora. We categorize all sources into a three-tier scheme that reflects varying risk levels and provide shard-level provenance metadata to enable risk-aware usage. In controlled experiments using the open-sci-ref training protocol (fixed architectures and hyperparameters; 50B and 300B token budgets across 130M-1.7B parameters), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B-parameters/300B-tokens setting, they surpass FineWeb-Edu and approach DCLM late in training. Performance is particularly strong on MMLU and on math and code benchmarks: a 1.7B model pretrained on 300B MixtureVitae tokens matches or exceeds a strong 1.7B instruction-tuned baseline on GSM8K, HumanEval, and MBPP, despite using over 36 times fewer tokens (300B vs. ~11T). Supported by a thorough decontamination analysis, these results show that permissive-first data with high instruction and reasoning density, tiered by licensing and provenance-related risk, can provide a practical and risk-mitigated foundation for training capable LLMs, reducing reliance on broad web scrapes without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae
中文标题/摘要
标题:MixtureVitae:开放的网络规模预训练数据集,基于宽松许可的文本源构建,包含高质量的指令和推理数据
我们介绍了MixtureVitae,这是一个开放访问的预训练语料库,旨在最小化法律风险同时提供强大的下游性能。MixtureVitae 采用了一种宽松许可优先、风险缓解的采样策略,结合了公共领域和宽松许可的文本(例如CC-BY/Apache)以及经过仔细验证的低风险添加(例如政府作品和欧盟TDM合格来源)。MixtureVitae 采用了一种简单的单阶段预训练配方,整合了大量的宽松许可合成指令和推理数据信号,这些信号通常在后训练阶段引入,而在宽松许可的网络语料库中通常较为稀缺。我们将所有来源分为三个层级,反映不同的风险级别,并提供分片级别的来源元数据,以支持风险意识使用。在使用开放科学参考训练协议(固定架构和超参数;130M-1.7B参数,50B和300B令牌预算)的受控实验中,使用MixtureVitae 训练的模型在一系列标准基准测试中始终优于其他宽松许可数据集,在1.7B参数/300B令牌设置下,它们超过了FineWeb-Edu,并接近DCLM的后期训练表现。特别是在MMLU和数学、代码基准测试中,表现尤为突出:一个使用300B MixtureVitae令牌预训练的1.7B模型在GSM8K、HumanEval和MBPP基准测试中达到了或超过了强大的1.7B指令调优基线,尽管使用了超过36倍少的令牌(300B vs. ~11T)。通过彻底的去污分析支持,这些结果表明,基于许可优先的数据,按许可证和来源相关风险分级,可以提供一种实用且风险缓解的基础,用于训练强大的语言模型,减少对广泛网络抓取的依赖,而不牺牲竞争力。
Summary / 总结
MixtureVitae is an open-access pretraining dataset that minimizes legal risk while maintaining strong downstream performance. It combines public-domain and permissively licensed text with carefully selected low-risk additions. Using a simple pretraining recipe, models trained on MixtureVitae outperform other permissive datasets across various benchmarks, especially on math and code tasks, where a 1.7B model pretrained on 300B tokens matches or exceeds a strong instruction-tuned baseline despite using significantly fewer tokens. This demonstrates the effectiveness of high-quality, permissively licensed data in training capable language models.
MixtureVitae 是一个开源预训练数据集,结合了公共领域和许可许可的文本,并选择了低风险的添加内容以最小化法律风险,同时提供强大的下游性能。它使用简单的单阶段预训练配方,整合了大量的合成指令和推理数据。使用 MixtureVitae 训练的模型在各种基准测试中表现优于其他许可数据集,特别是在 MMLU、数学和代码基准测试中,一个 1.7B 模型在 300B 令牌上预训练的表现与一个强大的 1.7B 指令调优基线相当,尽管使用了显著较少的令牌。
The Confidence Trap: Gender Bias and Predictive Certainty in LLMs
Authors: Ahmed Sabir, Markus Kängsepp, Rajesh Sharma
Venue: AAAI 2026 Oral
First: 2026-01-12T18:38:05+00:00 · Latest: 2026-01-12T18:38:05+00:00
Comments: AAAI 2026 (AISI Track), Oral. Project page: https://bit.ly/4p8OKQD
Abstract
The increased use of Large Language Models (LLMs) in sensitive domains leads to growing interest in how their confidence scores correspond to fairness and bias. This study examines the alignment between LLM-predicted confidence and human-annotated bias judgments. Focusing on gender bias, the research investigates probability confidence calibration in contexts involving gendered pronoun resolution. The goal is to evaluate if calibration metrics based on predicted confidence scores effectively capture fairness-related disparities in LLMs. The results show that, among the six state-of-the-art models, Gemma-2 demonstrates the worst calibration according to the gender bias benchmark. The primary contribution of this work is a fairness-aware evaluation of LLMs' confidence calibration, offering guidance for ethical deployment. In addition, we introduce a new calibration metric, Gender-ECE, designed to measure gender disparities in resolution tasks.
中文标题/摘要
标题:自信陷阱:性别偏见和LLM预测确定性
在敏感领域中大型语言模型(LLMs)使用量的增加引发了对其自信评分与公平性及偏见之间关系的兴趣。本研究探讨了LLM预测自信与人类标注偏见判断之间的契合度。研究重点放在性别偏见上,调查了涉及性别代词解析的自信概率校准情况。目标是评估基于预测自信评分的校准指标是否能有效捕捉LLMs中的公平性相关差异。结果显示,在六个最先进的模型中,Gemma-2在性别偏见基准测试中的校准最差。本工作的主要贡献是对LLMs的自信校准进行了公平性意识评估,为伦理部署提供了指导。此外,我们还引入了一个新的校准指标,性别ECE,用于衡量性别差异在解析任务中的表现。
Summary / 总结
This study investigates how the confidence scores of Large Language Models (LLMs) align with gender bias, focusing on probability confidence calibration in gendered pronoun resolution. Among six state-of-the-art models, Gemma-2 shows the worst calibration according to the gender bias benchmark. The research introduces a new metric, Gender-ECE, to measure gender disparities in resolution tasks, contributing to a fairness-aware evaluation of LLMs' confidence calibration for ethical deployment.
该研究探讨了大型语言模型(LLMs)的置信分数与性别偏见之间的关联,重点关注性别代词解析中的概率置信校准。在六个最先进的模型中,Gemma-2在性别偏见基准测试中的校准效果最差。研究引入了新的校准度量Gender-ECE,用于衡量解析任务中的性别差异,为LLMs的置信校准提供公平性评估,以指导伦理部署。
Discovering Coordinated Joint Options via Inter-Agent Relative Dynamics
Authors: Raul D. Steleac, Mohan Sridharan, David Abel
First: 2025-12-31T12:39:22+00:00 · Latest: 2026-01-12T18:29:50+00:00
Abstract
Temporally extended actions improve the ability to explore and plan in single-agent settings. In multi-agent settings, the exponential growth of the joint state space with the number of agents makes coordinated behaviours even more valuable. Yet, this same exponential growth renders the design of multi-agent options particularly challenging. Existing multi-agent option discovery methods often sacrifice coordination by producing loosely coupled or fully independent behaviours. Toward addressing these limitations, we describe a novel approach for multi-agent option discovery. Specifically, we propose a joint-state abstraction that compresses the state space while preserving the information necessary to discover strongly coordinated behaviours. Our approach builds on the inductive bias that synchronisation over agent states provides a natural foundation for coordination in the absence of explicit objectives. We first approximate a fictitious state of maximal alignment with the team, the \textit{Fermat} state, and use it to define a measure of \textit{spreadness}, capturing team-level misalignment on each individual state dimension. Building on this representation, we then employ a neural graph Laplacian estimator to derive options that capture state synchronisation patterns between agents. We evaluate the resulting options across multiple scenarios in two multi-agent domains, showing that they yield stronger downstream coordination capabilities compared to alternative option discovery methods.
中文标题/摘要
标题:通过代理相对动力学发现协调联合选项
时间扩展的动作在单代理环境中提高了探索和规划的能力。在多代理环境中,随着代理数量的增加,联合状态空间的指数增长使得协调行为更加有价值。然而,这种指数增长使得多代理选项的设计变得尤为具有挑战性。现有的多代理选项发现方法往往通过产生松散耦合或完全独立的行为来牺牲协调性。为了解决这些限制,我们描述了一种新的多代理选项发现方法。具体来说,我们提出了一种联合状态抽象,该抽象压缩了状态空间,同时保留了发现强烈协调行为所需的信息。我们的方法基于归纳偏见,即代理状态的同步为在没有明确目标的情况下协调提供了自然的基础。我们首先近似一个最大对齐的虚构状态,即“费马”状态,并使用它来定义一个“分散度”的度量,捕捉每个个体状态维度上的团队级不对齐。在此表示的基础上,我们然后使用神经图拉普拉斯估计器来推导出捕捉代理间状态同步模式的选项。我们在两个多代理领域中的多个场景中评估了这些选项,结果显示它们在下游协调能力方面优于其他选项发现方法。
Summary / 总结
The paper addresses the challenge of discovering coordinated joint options in multi-agent systems, where the exponential growth of the joint state space complicates the design of coordinated behaviors. It introduces a novel approach that uses a joint-state abstraction to compress the state space while preserving the necessary information for discovering strongly coordinated behaviors. The method leverages the inductive bias of state synchronization to define a measure of spreadness and employs a neural graph Laplacian estimator to derive options that capture state synchronization patterns between agents. Experimental results demonstrate that the proposed options yield stronger coordination capabilities compared to existing methods in various scenarios.
论文旨在解决在多智能体系统中发现协调联合选项的挑战,由于联合状态空间的指数增长使得设计协调行为变得复杂。作者提出了一种新颖的方法,通过联合状态抽象来压缩状态空间,同时保留发现强协调行为所需的必要信息。通过近似最大对齐的虚构状态并定义一个分散度度量,该方法识别出能够捕捉智能体之间状态同步模式的选项。实验结果表明,这些选项在各种场景中提供了更强的协调能力,优于其他方法。
StarFlow: Generating Structured Workflow Outputs From Sketch Images
Authors: Patrice Bechard, Chao Wang, Amirhossein Abaskohi, Juan Rodriguez, Christopher Pal, David Vazquez, Spandana Gella, Sai Rajeswar, Perouz Taslakian
First: 2025-03-27T18:04:05+00:00 · Latest: 2026-01-12T18:27:42+00:00
Comments: To be presented at EACL2026
Abstract
Workflows are a fundamental component of automation in enterprise platforms, enabling the orchestration of tasks, data processing, and system integrations. Despite being widely used, building workflows can be complex, often requiring manual configuration through low-code platforms or visual programming tools. To simplify this process, we explore the use of generative foundation models, particularly vision-language models (VLMs), to automatically generate structured workflows from visual inputs. Translating hand-drawn sketches or computer-generated diagrams into executable workflows is challenging due to the ambiguity of free-form drawings, variations in diagram styles, and the difficulty of inferring execution logic from visual elements. To address this, we introduce StarFlow, a framework for generating structured workflow outputs from sketches using vision-language models. We curate a diverse dataset of workflow diagrams -- including synthetic, manually annotated, and real-world samples -- to enable robust training and evaluation. We finetune and benchmark multiple vision-language models, conducting a series of ablation studies to analyze the strengths and limitations of our approach. Our results show that finetuning significantly enhances structured workflow generation, outperforming large vision-language models on this task.
中文标题/摘要
标题:StarFlow:从草图图像生成结构化工作流输出
工作流是企业平台自动化的基本组成部分,能够实现任务编排、数据处理和系统集成。尽管被广泛使用,但构建工作流可能很复杂,通常需要通过低代码平台或可视化编程工具进行手动配置。为了简化这一过程,我们探索了使用生成基础模型,特别是视觉语言模型(VLMs),从视觉输入自动生成结构化工作流的方法。将手绘草图或计算机生成的图表转换为可执行的工作流具有挑战性,因为自由形式的绘制具有歧义性,图表风格存在差异,从视觉元素中推断执行逻辑也具有难度。为此,我们引入了StarFlow框架,用于使用视觉语言模型从草图生成结构化工作流输出。我们收集了多样化的流程图数据集,包括合成、手动标注和实际世界样本,以实现稳健的训练和评估。我们对多个视觉语言模型进行了微调和基准测试,并进行了一系列消融研究,以分析我们方法的优势和局限性。我们的结果显示,微调显著提高了结构化工作流生成的效果,在此任务上优于大型视觉语言模型。
Summary / 总结
The paper aims to simplify the process of creating workflows by using generative foundation models, specifically vision-language models (VLMs), to automatically generate structured workflows from sketch images. The authors introduce StarFlow, a framework that curates a diverse dataset of workflow diagrams and finetunes VLMs to address the challenges of ambiguity and variations in free-form drawings. Experimental results demonstrate that finetuning VLMs significantly improves structured workflow generation, outperforming large VLMs on this task.
该论文旨在通过使用生成基础模型,特别是视觉语言模型(VLMs),从草图图像自动生成结构化的流程图来简化工作流的构建过程。文中介绍的StarFlow框架解决了将模糊的手绘图转换为可执行流程图的挑战。作者构建了一个多样化的数据集,并对多个VLM进行微调,结果显示微调显著提高了结构化流程图的生成效果,优于大型VLMs在该任务上的表现。
AgentCompress: Task-Aware Compression for Affordable Large Language Model Agents
Authors: Zuhair Ahmed Khan Taha, Mohammed Mudassir Uddin, Shahnawaz Alam
First: 2026-01-08T18:13:46+00:00 · Latest: 2026-01-12T18:25:18+00:00
Abstract
Large language models hold considerable promise for various applications, but their computational requirements create a barrier that many institutions cannot overcome. A single session using a 70-billion-parameter model can cost around $127 in cloud computing fees, which puts these tools out of reach for organizations operating on limited budgets. We present AgentCompress, a framework that tackles this problem through task-aware dynamic compression. The idea comes from a simple observation: not all tasks require the same computational effort. Complex reasoning, for example, is far more demanding than text reformatting, yet conventional compression applies the same reduction to both. Our approach uses a lightweight neural controller that looks at the first few tokens of each request, estimates how complex the task will be, and sends it to an appropriately quantized version of the model. This routing step adds only about 12 milliseconds of overhead. We tested the framework on 290 multi-stage workflows from domains including computer science, physics, chemistry, and biology. The results show a 68.3% reduction in computational costs while preserving 96.2% of the original success rate. These findings suggest that routing queries intelligently can make powerful language models substantially more affordable without sacrificing output quality
中文标题/摘要
标题:AgentCompress:面向任务的压缩技术以实现负担得起的大规模语言模型代理
大规模语言模型在各种应用中具有巨大的潜力,但其计算需求构成了许多机构无法逾越的障碍。使用一个包含700亿参数的模型进行一次会话的成本大约为127美元的云计算费用,这使得这些工具对于预算有限的组织来说遥不可及。我们提出了AgentCompress框架,通过任务感知动态压缩来解决这一问题。这一想法源于一个简单的观察:并非所有任务都需要相同的计算努力。例如,复杂的推理远比文本重排更为耗时,而传统的压缩方法对两者都应用相同的减少量。我们的方法使用一个轻量级的神经控制器,它查看每个请求的前几个标记,估计任务的复杂程度,并将其发送到适当量化版本的模型。这一路由步骤仅增加了大约12毫秒的开销。我们在包括计算机科学、物理学、化学和生物学在内的领域中的290个多阶段工作流上测试了该框架。结果显示,在保持原始成功率96.2%的情况下,计算成本降低了68.3%。这些发现表明,智能路由查询可以显著降低强大语言模型的成本,而不会牺牲输出质量
Summary / 总结
AgentCompress is a framework that reduces the computational costs of large language models by task-aware dynamic compression. It uses a lightweight neural controller to estimate the complexity of each task based on the first few tokens of the request and routes it to an appropriately quantized model version. This approach adds minimal overhead and significantly reduces costs while maintaining high success rates. The framework achieved a 68.3% reduction in computational costs with only a 3.8% decrease in success rate.
AgentCompress 是一个框架,通过任务感知的动态压缩来解决大规模语言模型的高计算成本问题。它使用一个轻量级的神经控制器根据每个请求的前几个标记来估计任务的复杂性,并将其路由到一个适当量化版本的模型,仅增加约12毫秒的额外延迟。该框架在来自计算机科学、物理学、化学和生物学等多个领域的290个多阶段工作流上进行了测试,实现了68.3%的计算成本降低,同时保持了96.2%的原始成功率。
Learning Through Dialogue: Unpacking the Dynamics of Human-LLM Conversations on Political Issues
Authors: Shaz Furniturewala, Gerard Christopher Yeo, Kokil Jaidka
First: 2026-01-12T18:10:21+00:00 · Latest: 2026-01-12T18:10:21+00:00
Abstract
Large language models (LLMs) are increasingly used as conversational partners for learning, yet the interactional dynamics supporting users' learning and engagement are understudied. We analyze the linguistic and interactional features from both LLM and participant chats across 397 human-LLM conversations about socio-political issues to identify the mechanisms and conditions under which LLM explanations shape changes in political knowledge and confidence. Mediation analyses reveal that LLM explanatory richness partially supports confidence by fostering users' reflective insight, whereas its effect on knowledge gain operates entirely through users' cognitive engagement. Moderation analyses show that these effects are highly conditional and vary by political efficacy. Confidence gains depend on how high-efficacy users experience and resolve uncertainty. Knowledge gains depend on high-efficacy users' ability to leverage extended interaction, with longer conversations benefiting primarily reflective users. In summary, we find that learning from LLMs is an interactional achievement, not a uniform outcome of better explanations. The findings underscore the importance of aligning LLM explanatory behavior with users' engagement states to support effective learning in designing Human-AI interactive systems.
中文标题/摘要
标题:通过对话学习:政治议题上人-LLM对话动态的剖析
大型语言模型(LLMs)越来越多地被用作学习的对话伙伴,然而支持用户学习和参与的互动动态却研究不足。我们分析了397个人-LLM关于社会政治议题的对话中的语言和互动特征,以识别LLM解释如何影响政治知识和信心的变化机制和条件。中介分析表明,LLM解释的丰富性部分通过促进用户的反思洞察来支持信心,而其对知识获取的影响则完全通过用户的认知参与。调节分析显示,这些影响高度依赖条件并因政治效能的不同而异。信心的提升取决于高效能用户如何体验和解决不确定性。知识的提升取决于高效能用户利用扩展互动的能力,长时间的对话主要有利于反思型用户。总之,我们发现从LLM学习是一种互动成就,而不仅仅是更好的解释的统一结果。研究结果强调了将LLM解释行为与用户参与状态对齐的重要性,以支持设计人机交互系统中的有效学习。
Summary / 总结
The study investigates how large language models (LLMs) facilitate learning through dialogue about socio-political issues, analyzing 397 human-LLM conversations. It finds that LLMs' explanatory richness enhances users' confidence by promoting reflective insight, while knowledge gain depends on users' cognitive engagement. The effects are conditional, with high-efficacy users benefiting more from extended interaction, particularly in resolving uncertainty and leveraging longer conversations for reflection. Overall, the research highlights the importance of tailoring LLM responses to users' engagement states to support effective learning in AI-human interactions.
研究分析了397场人类与大型语言模型(LLM)关于社会政治问题的对话,探讨了LLM如何通过对话促进学习。研究发现,LLM的解释丰富性通过促进用户的反思洞察来增强其信心,而知识的获得则依赖于用户的认知参与。这些效果具有条件性,高效能用户更能在延长的对话中通过解决不确定性并利用更长的互动来受益,特别是对于反思性用户而言。总体而言,研究强调了在设计人类-人工智能交互系统时,需要根据用户参与状态调整LLM的解释行为,以支持有效的学习。
Vision-Language Model for Accurate Crater Detection
Authors: Patrick Bauer, Marius Schwinning, Florian Renk, Andreas Weinmann, Hichem Snoussi
First: 2026-01-12T18:08:17+00:00 · Latest: 2026-01-12T18:08:17+00:00
Abstract
The European Space Agency (ESA), driven by its ambitions on planned lunar missions with the Argonaut lander, has a profound interest in reliable crater detection, since craters pose a risk to safe lunar landings. This task is usually addressed with automated crater detection algorithms (CDA) based on deep learning techniques. It is non-trivial due to the vast amount of craters of various sizes and shapes, as well as challenging conditions such as varying illumination and rugged terrain. Therefore, we propose a deep-learning CDA based on the OWLv2 model, which is built on a Vision Transformer, that has proven highly effective in various computer vision tasks. For fine-tuning, we utilize a manually labeled dataset fom the IMPACT project, that provides crater annotations on high-resolution Lunar Reconnaissance Orbiter Camera Calibrated Data Record images. We insert trainable parameters using a parameter-efficient fine-tuning strategy with Low-Rank Adaptation, and optimize a combined loss function consisting of Complete Intersection over Union (CIoU) for localization and a contrastive loss for classification. We achieve satisfactory visual results, along with a maximum recall of 94.0% and a maximum precision of 73.1% on a test dataset from IMPACT. Our method achieves reliable crater detection across challenging lunar imaging conditions, paving the way for robust crater analysis in future lunar exploration.
中文标题/摘要
标题:用于精确撞击坑检测的视觉-语言模型
欧洲航天局(ESA),因其计划中的阿戈纳特着陆器月球任务而雄心勃勃,对可靠的撞击坑检测有着深刻的兴趣,因为撞击坑对安全的月球着陆构成风险。通常使用基于深度学习技术的自动撞击坑检测算法(CDA)来解决这一任务。由于存在各种大小和形状的撞击坑,以及光照条件和崎岖地形等挑战性条件,这是一项非平凡的任务。因此,我们提出了一种基于OWLv2模型的深度学习CDA,该模型基于视觉变换器,在各种计算机视觉任务中已被证明非常有效。为了微调,我们使用IMPACT项目提供的手动标注数据集,该数据集提供了高分辨率月球轨道器摄像机校准数据记录图像上的撞击坑注释。我们使用参数高效微调策略Low-Rank Adaptation插入可训练参数,并优化了一个由完整交并比(CIoU)用于定位和对比损失用于分类组成的联合损失函数。我们在IMPACT提供的测试数据集上实现了令人满意的视觉效果,最大召回率为94.0%,最大精度为73.1%。我们的方法在具有挑战性的月球成像条件下实现了可靠的撞击坑检测,为未来月球探索中的稳健撞击坑分析铺平了道路。
Summary / 总结
The research aims to develop an accurate crater detection system for lunar missions, addressing the challenges posed by varying crater sizes, shapes, and imaging conditions. The method employs a fine-tuned OWLv2 model, a Vision Transformer-based deep learning algorithm, using a parameter-efficient fine-tuning strategy and a combined loss function. The results show a maximum recall of 94.0% and a maximum precision of 73.1%, demonstrating reliable crater detection under challenging lunar imaging conditions.
论文提出了一种基于OWLv2模型的深学习坑洞检测算法,该模型使用Vision Transformer并采用参数高效策略进行微调。模型在IMPACT项目的手动标注数据集上训练,并在测试数据集上实现了94.0%的最大召回率和73.1%的最大精确率,展示了在挑战性的月球成像条件下可靠的坑洞检测能力。
Metaphors are a Source of Cross-Domain Misalignment of Large Reasoning Models
Authors: Zhibo Hu, Chen Wang, Yanfeng Shu, Hye-young Paik, Liming Zhu
First: 2026-01-06T19:50:58+00:00 · Latest: 2026-01-12T18:08:06+00:00
Comments: 17 pages, 7 figures
Abstract
Earlier research has shown that metaphors influence human's decision making, which raises the question of whether metaphors also influence large language models (LLMs)' reasoning pathways, considering their training data contain a large number of metaphors. In this work, we investigate the problem in the scope of the emergent misalignment problem where LLMs can generalize patterns learned from misaligned content in one domain to another domain. We discover a strong causal relationship between metaphors in training data and the misalignment degree of LLMs' reasoning contents. With interventions using metaphors in pre-training, fine-tuning and re-alignment phases, models' cross-domain misalignment degrees change significantly. As we delve deeper into the causes behind this phenomenon, we observe that there is a connection between metaphors and the activation of global and local latent features of large reasoning models. By monitoring these latent features, we design a detector that predict misaligned content with high accuracy.
中文标题/摘要
标题:隐喻是大型推理模型跨域失准的来源
早期研究显示,隐喻影响人类的决策过程,这引发了关于隐喻是否也影响大型语言模型(LLMs)的推理路径的问题,因为它们的训练数据包含大量隐喻。在本研究中,我们探讨了这一问题在新兴的失准问题范围内的影响,即LLMs可以将一个领域中失准内容学到的模式泛化到另一个领域。我们发现,训练数据中的隐喻与LLMs推理内容的失准程度之间存在强烈的因果关系。通过在预训练、微调和重新对齐阶段使用隐喻进行干预,模型的跨域失准程度发生了显著变化。随着我们深入探究这一现象的原因,我们观察到隐喻与大型推理模型的全局和局部潜在特征的激活之间存在联系。通过监测这些潜在特征,我们设计了一个检测器,能够以高精度预测失准内容。
Summary / 总结
This study explores how metaphors in training data affect the reasoning pathways of large language models (LLMs), contributing to cross-domain misalignment. By investigating the emergent misalignment problem, the researchers find a strong causal relationship between metaphors and LLMs' misalignment. Interventions using metaphors during pre-training, fine-tuning, and re-alignment phases significantly reduce misalignment. The study also identifies a link between metaphors and the activation of latent features in LLMs, leading to the development of a detector that accurately predicts misaligned content.
研究探讨了训练数据中的隐喻如何影响大型语言模型(LLMs)的推理路径,导致跨域不一致。通过在预训练、微调和重新对齐阶段干预隐喻,研究人员发现模型的不一致程度发生了显著变化。他们还发现隐喻与LLMs中潜在特征的激活有关,并设计了一个检测器来准确预测不一致的内容。
Kinship Data Benchmark for Multi-hop Reasoning
Authors: Tianda Sun, Dimitar Kazakov
First: 2026-01-12T18:07:41+00:00 · Latest: 2026-01-12T18:07:41+00:00
Comments: 11 pages, 2 figures, 9 tables
Abstract
Large language models (LLMs) are increasingly evaluated on their ability to perform multi-hop reasoning, i.e., to combine multiple pieces of information into a coherent inference. We introduce KinshipQA, a benchmark designed to probe this capability through reasoning over kinship relations. The central contribution of our work is a generative pipeline that produces, on demand, large-scale, realistic, and culture-specific genealogical data: collections of interconnected family trees that satisfy explicit marriage constraints associated with different kinship systems. This allows task difficulty, cultural assumptions, and relational depth to be systematically controlled and varied. From these genealogies, we derive textual inference tasks that require reasoning over implicit relational chains. We evaluate the resulting benchmark using six state-of-the-art LLMs, spanning both open-source and closed-source models, under a uniform zero-shot protocol with deterministic decoding. Performance is measured using exact-match and set-based metrics. Our results demonstrate that KinshipQA yields a wide spread of outcomes and exposes systematic differences in multi-hop reasoning across models and cultural settings.
中文标题/摘要
标题:亲属关系数据基准测试用于多跳推理
大型语言模型(LLMs)越来越多地被评估其进行多跳推理的能力,即结合多个信息进行连贯的推理。我们引入了KinshipQA,一个旨在通过处理亲属关系来测试这种能力的基准测试。我们工作的主要贡献是一个生成管道,可以根据需求生成大规模、现实且文化特定的家谱数据:一系列相互连接的家庭树集合,满足与不同亲属制度相关的明确婚姻约束。这使得任务难度、文化假设和关系深度可以系统地控制和变化。从这些家谱中,我们推导出需要处理隐含关系链的文本推理任务。我们使用六种最先进的LLM进行评估,这些模型涵盖了开源和闭源模型,采用统一的零样本协议和确定性解码。性能通过精确匹配和集合基线度量进行评估。我们的结果表明,KinshipQA产生了广泛的结果,并揭示了不同模型和文化背景下多跳推理的系统性差异。
Summary / 总结
The research introduces KinshipQA, a benchmark for evaluating multi-hop reasoning in large language models by probing their ability to reason over kinship relations. The method involves generating large-scale, culture-specific genealogical data with explicit marriage constraints. The evaluation shows a wide range of performance across six state-of-the-art models, highlighting differences in their multi-hop reasoning capabilities across various cultural settings.
研究引入了KinshipQA基准,用于评估大型语言模型在处理亲缘关系推理方面的多跳推理能力。方法包括生成大规模、文化特定的族谱数据,并带有明确的婚姻约束。评估结果显示,六种最先进的模型在不同文化背景下的多跳推理能力存在显著差异。
Benchmarking Small Language Models and Small Reasoning Language Models on System Log Severity Classification
Authors: Yahya Masri, Emily Ma, Zifu Wang, Joseph Rogers, Chaowei Yang
First: 2026-01-12T18:02:33+00:00 · Latest: 2026-01-12T18:02:33+00:00
Comments: 28 pages, 5 figures, 7 tables
Abstract
System logs are crucial for monitoring and diagnosing modern computing infrastructure, but their scale and complexity require reliable and efficient automated interpretation. Since severity levels are predefined metadata in system log messages, having a model merely classify them offers limited standalone practical value, revealing little about its underlying ability to interpret system logs. We argue that severity classification is more informative when treated as a benchmark for probing runtime log comprehension rather than as an end task. Using real-world journalctl data from Linux production servers, we evaluate nine small language models (SLMs) and small reasoning language models (SRLMs) under zero-shot, few-shot, and retrieval-augmented generation (RAG) prompting. The results reveal strong stratification. Qwen3-4B achieves the highest accuracy at 95.64% with RAG, while Gemma3-1B improves from 20.25% under few-shot prompting to 85.28% with RAG. Notably, the tiny Qwen3-0.6B reaches 88.12% accuracy despite weak performance without retrieval. In contrast, several SRLMs, including Qwen3-1.7B and DeepSeek-R1-Distill-Qwen-1.5B, degrade substantially when paired with RAG. Efficiency measurements further separate models: most Gemma and Llama variants complete inference in under 1.2 seconds per log, whereas Phi-4-Mini-Reasoning exceeds 228 seconds per log while achieving <10% accuracy. These findings suggest that (1) architectural design, (2) training objectives, and (3) the ability to integrate retrieved context under strict output constraints jointly determine performance. By emphasizing small, deployable models, this benchmark aligns with real-time requirements of digital twin (DT) systems and shows that severity classification serves as a lens for evaluating model competence and real-time deployability, with implications for root cause analysis (RCA) and broader DT integration.
中文标题/摘要
标题:小型语言模型和小型推理语言模型在系统日志严重性分类中的基准测试
系统日志对于监控和诊断现代计算基础设施至关重要,但其规模和复杂性需要可靠的自动化解释。由于严重性级别是系统日志消息中的预定义元数据,因此仅让模型对其进行分类提供的独立实用价值有限,无法揭示其对系统日志解释能力的真正水平。我们认为,严重性分类更适合作为测试运行时日志理解能力的基准,而不是作为最终任务。使用来自Linux生产服务器的实际journalctl数据,我们评估了九种小型语言模型(SLMs)和小型推理语言模型(SRLMs)在零样本、少样本和检索增强生成(RAG)提示下的表现。结果表明存在明显的分层。Qwen3-4B在使用RAG时达到95.64%的最高准确率,而Gemma3-1B在少样本提示下从20.25%提高到使用RAG时的85.28%。值得注意的是,尽管没有检索,Qwen3-0.6B仍达到了88.12%的准确率。相比之下,包括Qwen3-1.7B和DeepSeek-R1-Distill-Qwen-1.5B在内的几种SRLMs在与RAG配对时表现大幅下降。效率测量进一步区分了模型:大多数Gemma和Llama变体每条日志的推理时间少于1.2秒,而Phi-4-Mini-Reasoning每条日志超过228秒,准确率低于10%。这些发现表明,(1)架构设计,(2)训练目标,以及(3)在严格输出约束下整合检索上下文的能力共同决定了性能。通过强调小型、可部署的模型,该基准与数字孪生(DT)系统的实时要求相一致,并表明严重性分类作为评估模型能力和实时部署性的镜像,具有对根本原因分析(RCA)和更广泛DT集成的含义。
Summary / 总结
This study evaluates nine small language models (SLMs) and small reasoning language models (SRLMs) on system log severity classification, using real-world journalctl data from Linux servers. The research finds that models perform significantly better with retrieval-augmented generation (RAG) prompting, with Qwen3-4B achieving 95.64% accuracy. The study also highlights the importance of architectural design and training objectives, showing that even tiny models like Qwen3-0.6B can achieve high accuracy with RAG. However, several SRLMs degrade with RAG, indicating that their performance is more dependent on retrieval capabilities. Efficiency measurements show that most Gemma and Llama variants are much faster than Phi-4-Mini-Reasoning, suggesting that real-time deployment is crucial for digital twin systems.
研究使用Linux服务器的实时journalctl数据评估了小型语言模型(SLMs)和小型推理语言模型(SRLMs)在系统日志严重性分类上的表现。研究发现,在零样本、少量样本和检索增强生成(RAG)提示下,模型表现出显著的性能差异,其中Qwen3-4B在RAG下达到最高准确率95.64%。研究还强调了架构设计、训练目标以及整合检索上下文的能力对于性能的重要性,表明严重性分类可以作为评估模型能力和实时部署性的基准,在数字孪生系统中具有重要意义。
Beyond Single-Shot: Multi-step Tool Retrieval via Query Planning
Authors: Wei Fang, James Glass
First: 2026-01-12T17:58:39+00:00 · Latest: 2026-01-12T17:58:39+00:00
Abstract
LLM agents operating over massive, dynamic tool libraries rely on effective retrieval, yet standard single-shot dense retrievers struggle with complex requests. These failures primarily stem from the disconnect between abstract user goals and technical documentation, and the limited capacity of fixed-size embeddings to model combinatorial tool compositions. To address these challenges, we propose TOOLQP, a lightweight framework that models retrieval as iterative query planning. Instead of single-shot matching, TOOLQP decomposes instructions into sub-tasks and dynamically generates queries to interact with the retriever, effectively bridging the semantic gap by targeting the specific sub-tasks required for composition. We train TOOLQP using synthetic query trajectories followed by optimization via Reinforcement Learning with Verifiable Rewards (RLVR). Experiments demonstrate that TOOLQP achieves state-of-the-art performance, exhibiting superior zero-shot generalization, robustness across diverse retrievers, and significant improvements in downstream agentic execution.
中文标题/摘要
标题:超越单轮:通过查询规划实现多步工具检索
在操作庞大且动态的工具库时,LLM代理依赖有效的检索,但标准的一轮密集检索器在处理复杂请求时存在困难。这些失败主要源于抽象用户目标与技术文档之间的脱节,以及固定大小嵌入的有限能力来建模工具组合。为了解决这些挑战,我们提出了TOOLQP,这是一种轻量级框架,将检索建模为迭代的查询规划。TOOLQP 不是进行一轮匹配,而是将指令分解为子任务,并动态生成查询与检索器交互,通过针对组合所需的特定子任务来有效弥合语义差距。我们使用合成查询轨迹训练TOOLQP,然后通过可验证奖励的强化学习(RLVR)进行优化。实验表明,TOOLQP 达到了最先进的性能,展示了出色的零样本泛化能力、在不同检索器上的鲁棒性以及下游代理执行中的显著改进。
Summary / 总结
The research addresses the limitations of single-shot dense retrievers in handling complex requests for LLM agents operating over large, dynamic tool libraries. It introduces TOOLQP, a framework that models retrieval as iterative query planning, decomposing instructions into sub-tasks and dynamically generating queries. Experiments show that TOOLQP outperforms existing methods, demonstrating better zero-shot generalization and robustness across different retrievers, and enhancing downstream agentic execution.
论文针对大规模动态工具库中LLM代理处理复杂请求时单次检索方法的局限性,提出了TOOLQP框架,将检索建模为迭代查询规划,将指令分解为子任务并动态生成查询。实验表明,TOOLQP在零样本泛化和不同检索器的鲁棒性方面优于现有方法,并提高了下游代理执行的效果。
Enhancing Self-Correction in Large Language Models through Multi-Perspective Reflection
Authors: Mariana Costa, Alberlucia Rafael Soarez, Daniel Kim, Camila Ferreira
First: 2026-01-12T17:57:05+00:00 · Latest: 2026-01-12T17:57:05+00:00
Abstract
While Chain-of-Thought (CoT) prompting advances LLM reasoning, challenges persist in consistency, accuracy, and self-correction, especially for complex or ethically sensitive tasks. Existing single-dimensional reflection methods offer insufficient improvements. We propose MyGO Poly-Reflective Chain-of-Thought (PR-CoT), a novel methodology employing structured multi-perspective reflection. After initial CoT, PR-CoT guides the LLM to self-assess its reasoning across multiple predefined angles: logical consistency, information completeness, biases/ethics, and alternative solutions. Implemented purely via prompt engineering, this process refines the initial CoT into a more robust and accurate final answer without model retraining. Experiments across arithmetic, commonsense, ethical decision-making, and logical puzzles, using GPT-three point five and GPT-four models, demonstrate PR-CoT's superior performance. It significantly outperforms traditional CoT and existing reflection methods in logical consistency and error correction, with notable gains in nuanced domains like ethical decision-making. Ablation studies, human evaluations, and qualitative analyses further validate the contribution of each reflection perspective and the overall efficacy of our poly-reflective paradigm in fostering more reliable LLM reasoning.
中文标题/摘要
标题:通过多视角反思增强大型语言模型的自我纠正能力
虽然链式思考(CoT)提示提高了LLM的推理能力,但在一致性、准确性和自我纠正方面仍存在挑战,尤其是在复杂或伦理敏感的任务中。现有的单一维度反思方法提供的改进不足。我们提出了MyGO多视角链式思考(PR-CoT)的新方法,采用结构化的多视角反思。在初始CoT之后,PR-CoT引导LLM从多个预定义的角度自我评估其推理:逻辑一致性、信息完整性、偏见/伦理以及替代解决方案。通过纯粹的提示工程实现,这一过程将初始CoT精炼为更稳健和准确的最终答案,无需对模型进行重新训练。使用GPT-3.5和GPT-4模型在算术、常识、伦理决策和逻辑谜题等领域的实验表明,PR-CoT表现出更优的性能。它在逻辑一致性与错误纠正方面显著优于传统CoT和现有反思方法,在伦理决策等细微领域也取得了显著进步。消融研究、人工评估和定性分析进一步验证了每个反思视角的贡献及其多视角反思范式在促进更可靠LLM推理方面的整体有效性。
Summary / 总结
This study addresses the limitations of Chain-of-Thought (CoT) prompting in large language models (LLMs) by proposing MyGO Poly-Reflective Chain-of-Thought (PR-CoT), which uses structured multi-perspective reflection to enhance reasoning consistency, accuracy, and self-correction. PR-CoT guides the LLM to self-assess its reasoning across logical consistency, information completeness, biases/ethics, and alternative solutions after initial CoT. Experiments with GPT-3.5 and GPT-4 on various tasks show that PR-CoT significantly improves logical consistency and error correction, particularly in ethical decision-making. Ablation studies and human evaluations support the effectiveness of this poly-reflective approach.
研究针对链式思考(CoT)提示在大型语言模型(LLMs)中的局限性,特别是在一致性、准确性和自我纠正方面的问题,尤其是在复杂或伦理敏感任务中的问题。研究引入了MyGO多视角反思链式思考(PR-CoT)方法,该方法引导LLMs从逻辑一致性、信息完整性、偏见/伦理和替代方案等多个角度自我评估其推理。实验表明,PR-CoT在逻辑一致性、错误纠正方面优于传统CoT和现有反思方法,并在伦理决策等复杂领域取得了显著改进。
OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent
Authors: Bowen Yang, Kaiming Jin, Zhenyu Wu, Zhaoyang Liu, Qiushi Sun, Zehao Li, JingJing Xie, Zhoumianze Liu, Fangzhi Xu, Kanzhi Cheng, Qingyun Li, Yian Wang, Yu Qiao, Zun Wang, Zichen Ding
First: 2026-01-12T17:55:51+00:00 · Latest: 2026-01-12T17:55:51+00:00
Comments: 31 pages, 11 figures, 12 tables
Abstract
While Vision-Language Models (VLMs) have significantly advanced Computer-Using Agents (CUAs), current frameworks struggle with robustness in long-horizon workflows and generalization in novel domains. These limitations stem from a lack of granular control over historical visual context curation and the absence of visual-aware tutorial retrieval. To bridge these gaps, we introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation: (1) a Reflection-Memory Agent that utilizes milestone-driven long-term memory to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks; (2) Versatile Tool Agents featuring a Multimodal Searcher that adopts a SeeAct paradigm to navigate a browser-based sandbox to synthesize live, visually aligned tutorials, thereby resolving fidelity issues in unseen scenarios. Experimental results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales, establishing new state-of-the-art results on three online benchmarks, notably achieving 65.84% on OSWorld.
中文标题/摘要
标题:OS-Symphony:一种全面的鲁棒且通用的计算机使用代理框架
尽管视觉语言模型(VLMs)显著推进了计算机使用代理(CUAs)的发展,但当前框架在长时序工作流程中的鲁棒性和新领域中的泛化能力方面存在局限。这些局限源于对历史视觉上下文编纂缺乏细粒度控制以及缺乏视觉感知的教程检索。为解决这些问题,我们提出了OS-Symphony,一种全面的框架,该框架包含一个协调两个关键创新的协调器,以实现鲁棒自动化:(1)一个反思记忆代理,利用里程碑驱动的长期记忆来实现轨迹级自我纠正,有效缓解长时序任务中的视觉上下文丢失问题;(2)多功能工具代理,配备多模态搜索器,采用“看做-行动”(SeeAct)范式在基于浏览器的沙盒中导航以合成实时、视觉对齐的教程,从而解决未见过场景中的保真度问题。实验结果表明,OS-Symphony在不同模型规模下实现了显著的性能提升,在三个在线基准测试中建立了新的最先进结果,特别是在OSWorld上达到65.84%。
Summary / 总结
The research aims to enhance the robustness and generalization of Computer-Using Agents (CUAs) by addressing limitations in current Vision-Language Models (VLMs). OS-Symphony, a holistic framework, introduces an Orchestrator with a Reflection-Memory Agent and Versatile Tool Agents. The Reflection-Memory Agent uses milestone-driven long-term memory for trajectory-level self-correction, while Versatile Tool Agents generate live, visually aligned tutorials through a SeeAct paradigm. Experiments show that OS-Symphony outperforms existing models, achieving 65.84% on the OSWorld benchmark.
论文通过引入OS-Symphony整体框架来解决当前视觉-语言模型(VLMs)在计算机使用代理(CUAs)中的局限性。该框架包括一个协调反射记忆代理和多功能工具代理的协调器。反射记忆代理使用长期记忆进行长时任务中的自我纠正,而多功能工具代理通过SeeAct范式生成实时、视觉对齐的教程。实验表明,OS-Symphony在多个基准测试中表现出色,特别是在OSWorld基准测试中达到65.84%的性能。
DT-ICU: Towards Explainable Digital Twins for ICU Patient Monitoring via Multi-Modal and Multi-Task Iterative Inference
Authors: Wen Guo
First: 2026-01-12T17:54:19+00:00 · Latest: 2026-01-12T17:54:19+00:00
Abstract
We introduce DT-ICU, a multimodal digital twin framework for continuous risk estimation in intensive care. DT-ICU integrates variable-length clinical time series with static patient information in a unified multitask architecture, enabling predictions to be updated as new observations accumulate over the ICU stay. We evaluate DT-ICU on the large, publicly available MIMIC-IV dataset, where it consistently outperforms established baseline models under different evaluation settings. Our test-length analysis shows that meaningful discrimination is achieved shortly after admission, while longer observation windows further improve the ranking of high-risk patients in highly imbalanced cohorts. To examine how the model leverages heterogeneous data sources, we perform systematic modality ablations, revealing that the model learnt a reasonable structured reliance on interventions, physiological response observations, and contextual information. These analyses provide interpretable insights into how multimodal signals are combined and how trade-offs between sensitivity and precision emerge. Together, these results demonstrate that DT-ICU delivers accurate, temporally robust, and interpretable predictions, supporting its potential as a practical digital twin framework for continuous patient monitoring in critical care. The source code and trained model weights for DT-ICU are publicly available at https://github.com/GUO-W/DT-ICU-release.
中文标题/摘要
标题:DT-ICU:通过多模态和多任务迭代推理实现可解释的ICU患者监测数字孪生
我们介绍了DT-ICU,这是一种多模态数字孪生框架,用于重症监护中的连续风险估计。DT-ICU将临床时间序列变量长度与静态患者信息统一在一个多任务架构中,使预测能够随着ICU住院期间新观察数据的积累而更新。我们在大型公开可用的MIMIC-IV数据集上评估了DT-ICU,结果显示在不同的评估设置下,它始终优于现有的基准模型。我们的测试长度分析表明,在入院后不久就能实现有意义的区分,而更长的观察窗口则进一步提高了在高度不平衡群体中高风险患者的排名。为了检查模型如何利用异构数据源,我们进行了系统的模态消融分析,揭示了模型在干预、生理反应观察和上下文信息方面学习到了合理的结构依赖性。这些分析提供了关于多模态信号如何结合以及灵敏度和精确度之间权衡的可解释见解。综上所述,这些结果表明,DT-ICU能够提供准确、时序稳健和可解释的预测,支持其作为重症监护中连续患者监测实用数字孪生框架的潜力。DT-ICU的源代码和训练模型权重可在https://github.com/GUO-W/DT-ICU-release/ 获取。
Summary / 总结
DT-ICU is a multimodal digital twin framework designed for continuous risk estimation in intensive care units. It integrates clinical time series and static patient information in a multitask architecture, allowing predictions to be updated as new data accumulates. DT-ICU outperforms existing models on the MIMIC-IV dataset, showing meaningful discrimination shortly after admission and improved ranking of high-risk patients with longer observation windows. Modality ablations reveal that the model effectively combines heterogeneous data sources, providing interpretable insights into the integration of multimodal signals and trade-offs between sensitivity and precision.
DT-ICU 是一个多模态数字孪生框架,结合临床时间序列和静态患者数据,用于重症监护中的连续风险评估。它随着新观察数据的积累更新预测,并在 MIMIC-IV 数据集上优于现有模型。该模型在入院后不久就能实现有意义的区分,并且随着观察窗口的延长,进一步提高了高风险患者的排名。模态消融研究表明,模型依赖于干预措施、生理反应和上下文信息,提供了对其功能的可解释洞察。DT-ICU 提供准确、时序稳健且可解释的预测,适用于重症监护中的实际患者监测。
Beyond External Guidance: Unleashing the Semantic Richness Inside Diffusion Transformers for Improved Training
Authors: Lingchen Sun, Rongyuan Wu, Zhengqiang Zhang, Ruibin Li, Yujing Sun, Shuaizheng Liu, Lei Zhang
First: 2026-01-12T17:52:11+00:00 · Latest: 2026-01-12T17:52:11+00:00
Abstract
Recent works such as REPA have shown that guiding diffusion models with external semantic features (e.g., DINO) can significantly accelerate the training of diffusion transformers (DiTs). However, this requires the use of pretrained external networks, introducing additional dependencies and reducing flexibility. In this work, we argue that DiTs actually have the power to guide the training of themselves, and propose \textbf{Self-Transcendence}, a simple yet effective method that achieves fast convergence using internal feature supervision only. It is found that the slow convergence in DiT training primarily stems from the difficulty of representation learning in shallow layers. To address this, we initially train the DiT model by aligning its shallow features with the latent representations from the pretrained VAE for a short phase (e.g., 40 epochs), then apply classifier-free guidance to the intermediate features, enhancing their discriminative capability and semantic expressiveness. These enriched internal features, learned entirely within the model, are used as supervision signals to guide a new DiT training. Compared to existing self-contained methods, our approach brings a significant performance boost. It can even surpass REPA in terms of generation quality and convergence speed, but without the need for any external pretrained models. Our method is not only more flexible for different backbones but also has the potential to be adopted for a wider range of diffusion-based generative tasks. The source code of our method can be found at https://github.com/csslc/Self-Transcendence.
中文标题/摘要
标题:超越外部指导:利用扩散变换器内部的语义丰富性以提高训练效果
近期研究表明,通过使用外部语义特征(例如DINO)来指导扩散模型(如DiT)的训练可以显著加速扩散变换器(DiTs)的训练过程。然而,这种方法需要使用预训练的外部网络,增加了依赖性并降低了灵活性。本文认为DiTs实际上具备自我指导训练的能力,并提出了一种名为\textbf{自我超越}的简单而有效的方法,仅通过内部特征监督即可实现快速收敛。研究发现,DiT训练中的缓慢收敛主要源于浅层表示学习的困难。为了解决这一问题,我们首先通过将DiT模型的浅层特征与预训练VAE的潜在表示对齐进行短暂训练(例如40个周期),然后对中间特征应用无分类器指导,增强其判别能力和语义表达能力。这些在模型内部学习到的丰富内部特征被用作监督信号,指导新的DiT训练。与现有的自包含方法相比,我们的方法带来了显著的性能提升。在生成质量和收敛速度方面,甚至可以超越REPA,但无需任何外部预训练模型。我们的方法不仅对不同的基础架构更具灵活性,还有潜力应用于更广泛的基于扩散的生成任务。我们的方法的源代码可以在https://github.com/csslc/Self-Transcendence/找到。
Summary / 总结
This work addresses the challenge of slow convergence in training diffusion transformers (DiTs) by proposing Self-Transcendence, a method that uses internal feature supervision to enhance discriminative capabilities. The approach initially aligns shallow features with pretrained VAE representations and then applies classifier-free guidance to intermediate features, leading to faster and more effective training. Compared to existing self-contained methods, this method significantly improves generation quality and convergence speed without relying on external pretrained models.
该研究通过提出Self-Transcendence方法,利用内部特征监督来增强辨别能力,解决了扩散变换器(DiTs)训练收敛慢的问题。该方法首先将浅层特征与预训练的VAE表示对齐,然后对中间特征应用无分类器引导,这些特征被用作进一步训练的监督信号。这种方法在提高训练速度和质量方面显著优于现有的自包含方法,且无需任何外部预训练模型。
Towards Mitigating Excessive Forgetting in LLM Unlearning via Entanglement-Guidance with Proxy Constraint
Authors: Zhihao Liu, Jian Lou, Yuke Hu, Xiaochen Li, Yitian Chen, Tailun Chen, Zhizhen Qin, Kui Ren, Zhan Qin
First: 2025-08-28T05:45:40+00:00 · Latest: 2026-01-12T17:50:04+00:00
Abstract
Large language models (LLMs) are trained on massive datasets that may include private or copyrighted content. Due to growing privacy and ownership concerns, data owners may request the removal of their data from trained models. Machine unlearning provides a practical solution by removing the influence of specific data without full retraining. However, most existing methods still suffer from over-unlearning due to the lack of a principled mechanism to regulate the forgetting boundary, leading to unnecessary utility degradation and heightened privacy and robustness risks. In this work, we propose EGUP (Entanglement-Guided Unlearning with Proxy Constraint), a novel framework that leverages entanglement and proxy constraint to guide the unlearning process while mitigating over-unlearning. Within each iteration, EGUP employs inter-sample entanglement to adaptively reweight the unlearning strength, assigning greater unlearning efforts to forget samples that are semantically closer to retained knowledge. Across iterations, EGUP leverages intra-sample entanglement to track the representation shift of each forget sample and dynamically adjust its unlearning effort. In addition, we incorporate a proxy constraint that approximates the model's expected outputs after unlearning, forming a reference boundary that softly regularizes the unlearning process. EGUP is compatible with existing gradient-based objectives and serves as a plug-and-play enhancement. We evaluate EGUP on the TOFU and MUSE benchmarks, demonstrating consistent improvements in the unlearning-utility trade-off across multiple LLMs. Moreover, EGUP achieves performance close to the retrained model while remaining scalable and robust.
中文标题/摘要
标题:通过代理约束引导的纠缠指导以减轻大语言模型卸载中的过度遗忘
大型语言模型(LLMs)在大规模数据集上进行训练,这些数据集可能包含私人或版权内容。由于隐私和所有权问题日益严重,数据所有者可能会要求从训练模型中删除其数据。机器卸载提供了一种实用的解决方案,通过移除特定数据的影响而不进行完全重新训练。然而,大多数现有方法仍然因缺乏调节遗忘边界的原理机制而遭受过度卸载的问题,导致不必要的功能退化和增强的隐私和鲁棒性风险。在本工作中,我们提出了EGUP(代理约束引导的纠缠卸载),这是一种新颖的框架,利用纠缠和代理约束来引导卸载过程并减轻过度卸载。在每次迭代中,EGUP 使用样本间纠缠自适应调整卸载强度,将更多的卸载努力分配给与保留知识在语义上更接近的样本。在迭代过程中,EGUP 利用样本内纠缠跟踪每个遗忘样本的表示变化,并动态调整其卸载努力。此外,我们引入了一个代理约束,它近似卸载后的模型预期输出,形成一个软性调节卸载过程的参考边界。EGUP 与现有的基于梯度目标兼容,并作为即插即用增强。我们在TOFU和MUSE基准上评估EGUP,展示了在多个LLM上卸载-功能权衡的一致改进。此外,EGUP 的性能接近重新训练的模型,同时保持可扩展性和鲁棒性。
Summary / 总结
This paper addresses the issue of excessive forgetting in machine unlearning of large language models (LLMs) by proposing EGUP (Entanglement-Guided Unlearning with Proxy Constraint). EGUP uses entanglement and proxy constraint to guide the unlearning process, adaptively reweighting unlearning efforts based on semantic similarity and dynamically adjusting unlearning efforts over iterations. The method improves the unlearning-utility trade-off and achieves performance close to retraining while maintaining scalability and robustness, as shown in evaluations on TOFU and MUSE benchmarks across multiple LLMs.
本文通过提出EGUP(Entanglement-Guided Unlearning with Proxy Constraint)来解决大规模语言模型(LLMs)机器卸载过程中过度遗忘的问题。EGUP 使用纠缠和代理约束来引导卸载过程,减轻过度卸载。它根据语义相似性自适应调整卸载强度,并根据表示变化动态调整卸载努力。EGUP 在多个LLMs上展示了在卸载-效用权衡中的持续改进,并且性能接近重新训练的模型,同时保持可扩展性和鲁棒性。
Are LLM Decisions Faithful to Verbal Confidence?
Authors: Jiawei Wang, Yanfei Zhou, Siddartha Devic, Deqing Fu
First: 2026-01-12T17:49:51+00:00 · Latest: 2026-01-12T17:49:51+00:00
Abstract
Large Language Models (LLMs) can produce surprisingly sophisticated estimates of their own uncertainty. However, it remains unclear to what extent this expressed confidence is tied to the reasoning, knowledge, or decision making of the model. To test this, we introduce $\textbf{RiskEval}$: a framework designed to evaluate whether models adjust their abstention policies in response to varying error penalties. Our evaluation of several frontier models reveals a critical dissociation: models are neither cost-aware when articulating their verbal confidence, nor strategically responsive when deciding whether to engage or abstain under high-penalty conditions. Even when extreme penalties render frequent abstention the mathematically optimal strategy, models almost never abstain, resulting in utility collapse. This indicates that calibrated verbal confidence scores may not be sufficient to create trustworthy and interpretable AI systems, as current models lack the strategic agency to convert uncertainty signals into optimal and risk-sensitive decisions.
中文标题/摘要
标题:大规模语言模型的决策是否忠实于口头信心?
大规模语言模型(LLMs)可以产生令人惊讶的关于自身不确定性的复杂估计。然而,尚不清楚这种表达的信心与模型的推理、知识或决策有多大关联。为了测试这一点,我们引入了RiskEval:一种旨在评估模型是否根据不同的错误惩罚调整其弃权策略的框架。我们的评估显示了几种前沿模型的关键分离:模型在表达口头信心时既不成本意识强,也不在高惩罚条件下战略性地决定是否参与或弃权。即使极端惩罚使频繁弃权成为数学上的最优策略,模型几乎从不弃权,导致效用崩溃。这表明,校准的口头信心评分可能不足以创建可信赖且可解释的人工智能系统,因为当前模型缺乏将不确定性信号转化为最优和风险敏感决策的战略自主性。
Summary / 总结
The study investigates whether Large Language Models (LLMs) accurately reflect their uncertainty through verbal confidence. It introduces the RiskEval framework to assess models' response to varying error penalties. The research finds that models do not adjust their verbal confidence based on costs and do not strategically abstain even under high-penalty conditions, leading to utility collapse. This suggests that calibrated verbal confidence scores might not be enough for trustworthy and interpretable AI systems, as models lack the strategic agency to make risk-sensitive decisions.
研究旨在探讨大型语言模型(LLMs)表达的口头信心与其实际推理和决策过程之间的关系。研究人员开发了一个名为RiskEval的框架,以评估模型在面对不同错误惩罚时如何调整其决策策略。主要发现表明,LLMs的口头信心与其推理或战略决策并不一致,尤其是在高惩罚条件下。即使频繁避免参与是数学上最优的选择,模型也很少选择避免,导致效用下降。这表明,当前LLMs的校准口头信心评分可能不足以创建值得信赖和可解释的AI系统。
Contrastive Learning with Narrative Twins for Modeling Story Salience
Authors: Igor Sterner, Alex Lascarides, Frank Keller
First: 2026-01-12T17:48:46+00:00 · Latest: 2026-01-12T17:48:46+00:00
Comments: EACL 2026
Abstract
Understanding narratives requires identifying which events are most salient for a story's progression. We present a contrastive learning framework for modeling narrative salience that learns story embeddings from narrative twins: stories that share the same plot but differ in surface form. Our model is trained to distinguish a story from both its narrative twin and a distractor with similar surface features but different plot. Using the resulting embeddings, we evaluate four narratologically motivated operations for inferring salience (deletion, shifting, disruption, and summarization). Experiments on short narratives from the ROCStories corpus and longer Wikipedia plot summaries show that contrastively learned story embeddings outperform a masked-language-model baseline, and that summarization is the most reliable operation for identifying salient sentences. If narrative twins are not available, random dropout can be used to generate the twins from a single story. Effective distractors can be obtained either by prompting LLMs or, in long-form narratives, by using different parts of the same story.
中文标题/摘要
标题:基于叙事双胞胎的对比学习以建模故事显著性
理解叙事需要识别哪些事件对故事的发展最具显著性。我们提出了一种对比学习框架,用于建模叙事显著性,该框架从叙事双胞胎中学习故事嵌入:这些故事具有相同的情节但表面形式不同。我们的模型被训练以区分一个故事与其叙事双胞胎以及一个具有相似表面特征但不同情节的干扰物。利用生成的嵌入,我们评估了四种基于叙事学动机的操作以推断显著性(删除、移位、中断和总结)。在来自ROCStories语料库的短叙事和来自维基百科的长篇情节摘要上的实验表明,对比学习得到的故事嵌入优于掩码语言模型基线,并且总结是识别显著句子最可靠的操作。如果不存在叙事双胞胎,可以从单个故事中使用随机丢弃生成双胞胎。有效的干扰物可以通过提示大语言模型获得,或者在长篇叙事中,通过使用同一个故事的不同部分获得。
Summary / 总结
The paper aims to develop a method for identifying salient events in narratives by using contrastive learning with narrative twins. Narrative twins are stories with the same plot but different surface forms. The model is trained to distinguish a story from its twin and a distractor. Experiments show that contrastively learned story embeddings outperform a masked-language-model baseline, and summarization is the most effective operation for identifying salient sentences. If twins are not available, random dropout can generate twins, and LLMs or different parts of the same story can provide effective distractors.
研究旨在通过使用一种对比学习框架来识别叙述中最关键的事件,该框架基于具有相同情节但表面形式不同的叙述双胞胎进行训练。模型能够区分一个故事与其双胞胎以及具有相似表面特征但不同情节的干扰物。实验表明,对比学习得到的故事嵌入优于掩码语言模型基线,而摘要化是识别关键句子最有效的方法。如果不存在叙述双胞胎,可以通过随机丢弃生成它们,有效的干扰物可以通过提示大语言模型或在长篇叙述中使用同一故事的不同部分来获得。
LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation
Authors: Yang Miao, Jan-Nico Zaech, Xi Wang, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool
Venue: Neurips 2025
First: 2025-10-29T08:21:59+00:00 · Latest: 2026-01-12T17:46:52+00:00
Comments: 10 pages, 5 figures, 14 tables, Neurips 2025
Abstract
We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.
中文标题/摘要
标题:LangHOPS:基于语言的层次开放词汇部件分割
我们提出了LangHOPS,这是第一个基于多模态大型语言模型(MLLM)的开放词汇对象部件实例分割框架。给定一张图像,LangHOPS 可以从开放词汇候选类别中联合检测和分割层次化对象和部件实例。与依赖启发式或可学习视觉分组的先前方法不同,我们的方法将对象部件层次结构扎根于语言空间。它将 MLLM 集成到对象部件解析管道中,利用其丰富的知识和推理能力,并在层次结构内链接多粒度概念。我们在多个具有挑战性的场景中评估了LangHOPS,包括领域内和跨数据集对象部件实例分割以及零样本语义分割。LangHOPS 达到了最先进的结果,在 PartImageNet 数据集上超越了先前方法 5.5% 的平均精度(AP)(领域内)和 4.8%(跨数据集),以及在 ADE20K 中未见过的对象部件上达到了 2.5% 的 mIOU(零样本)。消融研究进一步验证了语言扎根层次结构和 MLLM 驱动部件查询精炼策略的有效性。代码将在此发布。
Summary / 总结
LangHOPS is a framework that uses a Multimodal Large Language Model to perform open-vocabulary object-part instance segmentation. It can detect and segment hierarchical object and part instances from a wide range of categories. Unlike previous methods, LangHOPS grounds object-part hierarchies in language space and integrates a MLLM to leverage its knowledge and reasoning capabilities. LangHOPS outperforms previous methods by 5.5% AP in-domain and 4.8% AP cross-dataset on PartImageNet, and by 2.5% mIOU on unseen object parts in ADE20K for zero-shot segmentation. Ablation studies confirm the effectiveness of the language-grounded hierarchy and part query refinement strategy.
LangHOPS 是一种使用多模态大型语言模型进行开放词汇对象部件实例分割的框架。它可以检测和分割图像中来自多种类别的层次化对象和部件实例。不同于以往依赖视觉分组的方法,LangHOPS 将对象部件层次结构置于语言空间中,并将 MLLM 集成到解析管道中,利用其知识和推理能力。LangHOPS 在 PartImageNet 数据集上的室内和跨数据集 AP 分别优于先前方法 5.5% 和 4.8%,在 ADE20K 的零样本分割中对未见过的对象部件的 mIOU 提高了 2.5%。消融研究进一步验证了语言导向的层次结构和 MLLM 驱动的部件查询精炼策略的有效性。
Video Evidence to Reasoning Efficient Video Understanding via Explicit Evidence Grounding
Authors: Yanxiang Huang, Guohua Gao, Zhaoyang Wei, Jianyuan Ni
Venue: ICME 2026
First: 2026-01-12T17:46:10+00:00 · Latest: 2026-01-12T17:46:10+00:00
Comments: 6 pages
Abstract
Large Vision-Language Models (LVLMs) face a fundamental dilemma in video reasoning: they are caught between the prohibitive computational costs of verbose reasoning and the hallucination risks of efficient, ungrounded approaches. To resolve this, we introduce the Chain of Evidence (CoE), a novel framework that architecturally decouples and co-optimizes perceptual grounding and reasoning efficiency. CoE incorporates two core innovations: (1) A lightweight Evidence Grounding Module (EGM) that acts as a query-guided filter, dynamically identifying and extracting a compact set of high-fidelity visual evidence; and (2) An Evidence-Anchoring Protocol optimized via Reinforcement Learning. Crucially, we design a composite reward mechanism that enforces process alignment, compelling the model to strictly reference identified temporal anchors during deduction, thereby mitigating hallucinations. To enable this, we construct CoE-Instruct, a large-scale dataset (164k samples) featuring a novel dual-annotation schema for separate perception and reasoning supervision. Extensive experiments on five benchmarks, including Video-MME, MVBench, and VSI-Bench, demonstrate that CoE-enhanced models establish a new state-of-the-art. They significantly outperform existing methods in accuracy, proving CoE to be a powerful and practical paradigm for reliable video understanding.
中文标题/摘要
标题:视频证据到推理:通过明确的证据关联实现高效视频理解
大型视觉-语言模型(LVLMs)在视频推理中面临一个根本性的困境:它们在冗长推理的高昂计算成本和高效但未关联方法的幻觉风险之间徘徊。为了解决这一问题,我们引入了证据链(CoE),这是一种新颖的框架,通过架构解耦和联合优化感知关联和推理效率。CoE 包含两个核心创新:(1)一种轻量级的证据关联模块(EGM),作为查询引导的过滤器,动态识别并提取一组高保真视觉证据;(2)一种通过强化学习优化的证据锚定协议。关键的是,我们设计了一种复合奖励机制,强制模型在推理过程中严格参考已识别的时间锚点,从而减轻幻觉。为了实现这一点,我们构建了CoE-Instruct,这是一个大规模数据集(164,000个样本),包含一种新的双注释方案,用于分别监督感知和推理。在包括Video-MME、MVBench和VSI-Bench在内的五个基准上的广泛实验表明,增强后的CoE模型建立了新的最先进的水平。它们在准确性上显著优于现有方法,证明CoE是一种强大且实用的可靠视频理解范式。
Summary / 总结
This paper addresses the challenge of efficient video understanding by introducing the Chain of Evidence (CoE) framework, which decouples perceptual grounding and reasoning efficiency. CoE includes a lightweight Evidence Grounding Module (EGM) that dynamically selects relevant visual evidence and an Evidence-Anchoring Protocol optimized via Reinforcement Learning. The composite reward mechanism ensures the model references temporal anchors, reducing hallucinations. Experiments on five benchmarks show that CoE-enhanced models outperform existing methods in accuracy, establishing a new state-of-the-art.
研究提出了Chain of Evidence (CoE)框架,以解决高效视频理解的挑战,该框架将感知接地和推理效率分离。CoE 包含一个轻量级的证据接地模块(EGM),用于过滤和提取相关视觉证据,以及通过强化学习优化的证据锚定协议。研究引入了CoE-Instruct大数据集,用于独立的感知和推理监督。实验表明,CoE增强的模型在多个基准测试中表现出色,显著优于现有方法,证明了该框架在减少幻觉和提高视频理解效率方面的有效性。
Free-RBF-KAN: Kolmogorov-Arnold Networks with Adaptive Radial Basis Functions for Efficient Function Learning
Authors: Shao-Ting Chiu, Siu Wun Cheung, Ulisses Braga-Neto, Chak Shing Lee, Rui Peng Li
First: 2026-01-12T17:45:31+00:00 · Latest: 2026-01-12T17:45:31+00:00
Abstract
Kolmogorov-Arnold Networks (KANs) have shown strong potential for efficiently approximating complex nonlinear functions. However, the original KAN formulation relies on B-spline basis functions, which incur substantial computational overhead due to De Boor's algorithm. To address this limitation, recent work has explored alternative basis functions such as radial basis functions (RBFs) that can improve computational efficiency and flexibility. Yet, standard RBF-KANs often sacrifice accuracy relative to the original KAN design. In this work, we propose Free-RBF-KAN, a RBF-based KAN architecture that incorporates adaptive learning grids and trainable smoothness to close this performance gap. Our method employs freely learnable RBF shapes that dynamically align grid representations with activation patterns, enabling expressive and adaptive function approximation. Additionally, we treat smoothness as a kernel parameter optimized jointly with network weights, without increasing computational complexity. We provide a general universality proof for RBF-KANs, which encompasses our Free-RBF-KAN formulation. Through a broad set of experiments, including multiscale function approximation, physics-informed machine learning, and PDE solution operator learning, Free-RBF-KAN achieves accuracy comparable to the original B-spline-based KAN while delivering faster training and inference. These results highlight Free-RBF-KAN as a compelling balance between computational efficiency and adaptive resolution, particularly for high-dimensional structured modeling tasks.
中文标题/摘要
标题:Free-RBF-KAN:具有自适应径向基函数的柯尔莫哥洛夫-阿诺尔德网络,用于高效函数学习
通过一系列广泛的实验,包括多尺度函数逼近、基于物理的机器学习和PDE解算器学习,Free-RBF-KAN在准确度上与基于B样条的原始KAN相当,同时提供更快的训练和推理。这些结果突显了Free-RBF-KAN在计算效率和自适应分辨率之间的平衡,特别是在高维结构化建模任务中。
Summary / 总结
Free-RBF-KAN is a novel RBF-based Kolmogorov-Arnold Network that incorporates adaptive learning grids and trainable smoothness to improve computational efficiency and accuracy. Through various experiments, it achieves comparable accuracy to the original B-spline-based KAN while offering faster training and inference, making it suitable for high-dimensional structured modeling tasks.
Free-RBF-KAN 是一种基于 RBF 的 Kolmogorov-Arnold 网络,通过自适应学习网格和可训练平滑度来提高计算效率和准确性。它动态调整 RBF 形状以与激活模式对齐,相比传统的 B-spline 基础 KAN,训练和推理速度更快。实验表明,Free-RBF-KAN 在保持高精度的同时更为高效,特别适用于高维结构化建模任务。
Structure First, Reason Next: Enhancing a Large Language Model using Knowledge Graph for Numerical Reasoning in Financial Documents
Authors: Aryan Mishra, Akash Anil
First: 2026-01-12T17:39:08+00:00 · Latest: 2026-01-12T17:39:08+00:00
Abstract
Numerical reasoning is an important task in the analysis of financial documents. It helps in understanding and performing numerical predictions with logical conclusions for the given query seeking answers from financial texts. Recently, Large Language Models (LLMs) have shown promising results in multiple Question-Answering (Q-A) systems with the capability of logical reasoning. As documents related to finance often consist of long and complex financial contexts, LLMs appear well-suited for building high-quality automated financial question-answering systems. However, LLMs often face challenges in accurately processing the various numbers within financial reports. Extracting numerical data from unstructured text and semi-structured tables, and reliably performing accurate calculations, remains a significant bottleneck for numerical reasoning in most state-of-the-art LLMs. Recent studies have shown that structured data augmentations, such as Knowledge Graphs (KGs), have notably improved the predictions of LLMs along with logical explanations. Thus, it is an important requirement to consider inherent structured information in financial reports while using LLMs for various financial analytics. This paper proposes a framework to incorporate structured information using KGs along with LLM predictions for numerical reasoning tasks. The KGs are extracted using a proposed schema inherently from the document under processing. We evaluated our proposed framework over the benchmark data FinQA, using an open-source LLM, namely Llama 3.1 8B Instruct. We observed that the proposed framework improved execution accuracy by approximately 12% relative to the vanilla LLM.
中文标题/摘要
标题:结构优先,推理随后:使用知识图谱增强大型语言模型在财务文档中的数值推理能力
数值推理是财务文档分析中的一个重要任务。它有助于理解和进行基于给定查询的数值预测和逻辑结论。近年来,大型语言模型(LLMs)在多个问答(Q-A)系统中展示了令人鼓舞的结果,具备逻辑推理能力。由于与财务相关的文档通常包含长且复杂的财务背景,LLMs 很适合构建高质量的自动化财务问答系统。然而,LLMs 在准确处理财务报告中的各种数字方面经常面临挑战。从非结构化文本和半结构化表格中提取数值数据并可靠地进行准确计算,仍然是大多数先进LLMs在数值推理中的一个重大瓶颈。最近的研究表明,结构化数据增强,如知识图谱(KGs),显著提高了LLMs的预测能力及其逻辑解释。因此,在使用LLMs进行各种财务分析时,考虑财务报告中的固有结构信息是重要要求。本文提出了一种框架,结合KGs和LLM预测来执行数值推理任务。KGs是根据处理文档中提出的一种内在模式进行提取的。我们使用开源LLM Llama 3.1 8B Instruct在基准数据集FinQA上评估了我们提出的方法。我们观察到,与原始LLM相比,所提出的方法在执行准确性上提高了约12%。
Summary / 总结
This paper aims to enhance large language models (LLMs) for numerical reasoning in financial documents by integrating Knowledge Graphs (KGs). The method involves extracting structured information from financial documents to improve LLM predictions. The proposed framework was evaluated on the FinQA benchmark using Llama 3.1 8B Instruct, showing a 12% relative improvement in execution accuracy compared to the vanilla LLM.
本文提出了一种框架,将知识图谱(KGs)与大型语言模型(LLMs)结合,以解决财务文件中的数值推理问题。动机是提高财务文本中数值预测的准确性。方法是从财务文档中提取结构化信息来增强LLM的预测。关键实验结果表明,与仅使用LLM相比,所提出的方法将执行准确性提高了约12%。
History
20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553