Operationalising the Superficial Alignment Hypothesis via Task Complexity
Authors: Tomás Vergara-Browne, Darshan Patil, Ivan Titov, Siva Reddy, Tiago Pimentel, Marius Mosbach
First: 2026-02-17T18:59:39+00:00 · Latest: 2026-02-17T18:59:39+00:00
Abstract
The superficial alignment hypothesis (SAH) posits that large language models learn most of their knowledge during pre-training, and that post-training merely surfaces this knowledge. The SAH, however, lacks a precise definition, which has led to (i) different and seemingly orthogonal arguments supporting it, and (ii) important critiques to it. We propose a new metric called task complexity: the length of the shortest program that achieves a target performance on a task. In this framework, the SAH simply claims that pre-trained models drastically reduce the complexity of achieving high performance on many tasks. Our definition unifies prior arguments supporting the SAH, interpreting them as different strategies to find such short programs. Experimentally, we estimate the task complexity of mathematical reasoning, machine translation, and instruction following; we then show that these complexities can be remarkably low when conditioned on a pre-trained model. Further, we find that pre-training enables access to strong performances on our tasks, but it can require programs of gigabytes of length to access them. Post-training, on the other hand, collapses the complexity of reaching this same performance by several orders of magnitude. Overall, our results highlight that task adaptation often requires surprisingly little information -- often just a few kilobytes.
中文标题/摘要
标题:通过任务复杂度实现表层对齐假说的操作化
表层对齐假说(SAH)认为,大型语言模型在其预训练过程中学习了大部分知识,而后续训练只是将这些知识呈现出来。然而,SAH 缺乏精确的定义,这导致了(i)支持它的不同且看似独立的论据,以及(ii)对其的重要批评。我们提出了一种新的度量标准,即任务复杂度:实现特定任务目标性能的最短程序的长度。在这种框架下,SAH 简单地声称,预训练模型极大地降低了在许多任务上实现高性能的复杂度。我们的定义统一了之前支持 SAH 的论据,将它们解释为寻找此类短程序的不同策略。实验上,我们估计了数学推理、机器翻译和指令遵循的任务复杂度;然后我们展示了在预训练模型条件下,这些复杂度可以出奇地低。此外,我们发现预训练使访问我们任务的强性能变得可能,但可能需要数吉字长的程序才能实现。另一方面,后续训练将达到相同性能的复杂度压缩了几个数量级。总体而言,我们的结果突显了任务适应往往只需要极少的信息——通常只需几千字节。
Summary / 总结
The study operationalizes the superficial alignment hypothesis (SAH) by defining task complexity as the length of the shortest program achieving a target performance. The research shows that pre-training significantly reduces the complexity of achieving high performance on tasks like mathematical reasoning, machine translation, and instruction following, while post-training further decreases this complexity by several orders of magnitude. This suggests that pre-trained models enable access to strong performances with minimal additional information.
研究通过定义任务复杂性为实现目标性能所需最短程序的长度来操作化浅表对齐假设(SAH)。结果显示,预训练显著降低了在数学推理、机器翻译和指令遵循等任务上达到高性能的复杂性,而后续训练进一步将这一复杂性降低了几个数量级。这表明预训练模型能够以极少量的额外信息访问强大的性能。
Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation
Authors: Yuxuan Kuang, Sungjae Park, Katerina Fragkiadaki, Shubham Tulsiani
First: 2026-02-17T18:59:31+00:00 · Latest: 2026-02-17T18:59:31+00:00
Comments: Project page: https://dex4d.github.io/
Abstract
Learning generalist policies capable of accomplishing a plethora of everyday tasks remains an open challenge in dexterous manipulation. In particular, collecting large-scale manipulation data via real-world teleoperation is expensive and difficult to scale. While learning in simulation provides a feasible alternative, designing multiple task-specific environments and rewards for training is similarly challenging. We propose Dex4D, a framework that instead leverages simulation for learning task-agnostic dexterous skills that can be flexibly recomposed to perform diverse real-world manipulation tasks. Specifically, Dex4D learns a domain-agnostic 3D point track conditioned policy capable of manipulating any object to any desired pose. We train this 'Anypose-to-Anypose' policy in simulation across thousands of objects with diverse pose configurations, covering a broad space of robot-object interactions that can be composed at test time. At deployment, this policy can be zero-shot transferred to real-world tasks without finetuning, simply by prompting it with desired object-centric point tracks extracted from generated videos. During execution, Dex4D uses online point tracking for closed-loop perception and control. Extensive experiments in simulation and on real robots show that our method enables zero-shot deployment for diverse dexterous manipulation tasks and yields consistent improvements over prior baselines. Furthermore, we demonstrate strong generalization to novel objects, scene layouts, backgrounds, and trajectories, highlighting the robustness and scalability of the proposed framework.
中文标题/摘要
标题:Dex4D:通用点轨迹策略框架实现模拟到现实的灵巧操作
学习能够完成多种日常任务的一般性策略仍然是灵巧操作领域的开放挑战。特别是,通过现实世界的远程操作收集大规模操作数据既昂贵又难以扩展。虽然在模拟中学习提供了一种可行的替代方案,但设计多个特定任务的环境和奖励进行训练同样具有挑战性。我们提出了Dex4D框架,该框架利用模拟来学习任务无关的灵巧技能,这些技能可以在测试时灵活重组以执行各种现实世界的操作任务。具体而言,Dex4D学习了一种领域无关的3D点轨迹条件策略,该策略能够操作任何物体到任何期望的姿态。我们在数千种具有不同姿态配置的物体上对这种“任意姿态到任意姿态”的策略进行了模拟训练,涵盖了可以在测试时组合的广泛机器人-物体交互空间。在部署时,该策略可以通过仅提示其期望的物体中心点轨迹(从生成的视频中提取)来零样本转移至现实世界的任务,无需微调。在执行过程中,Dex4D使用在线点跟踪进行闭环感知和控制。在模拟和真实机器人上的大量实验表明,我们的方法能够实现多种灵巧操作任务的零样本部署,并且在先前基线方法上取得了持续改进。此外,我们展示了其在新型物体、场景布局、背景和轨迹上的强大泛化能力,突显了所提出框架的鲁棒性和可扩展性。
Summary / 总结
Dex4D is a framework designed to learn task-agnostic dexterous manipulation skills in simulation, which can be flexibly applied to various real-world tasks. It trains a 3D point track policy to manipulate any object to any desired pose across thousands of objects with diverse configurations. During deployment, the policy can be zero-shot transferred to real-world tasks by prompting it with desired object-centric point tracks. Experiments show that Dex4D outperforms previous methods and demonstrates strong generalization to novel objects and scenes.
Dex4D 是一个框架,旨在通过模拟学习通用的灵巧操作技能,这些技能可以灵活应用于各种实际任务。它训练了一个3D点轨迹策略,可以在数千个具有不同配置的对象上操纵任何物体到任何期望的姿态。该策略可以通过提示它所需的物体中心点轨迹在实际任务中零样本转移。实验表明,Dex4D 在性能上优于先前的方法,并且在新物体和场景中表现出强大的泛化能力。
Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence
Authors: Alisa Vinogradova, Vlad Vinogradov, Luba Greenwood, Ilya Yasny, Dmitry Kobyzev, Shoman Kasbekar, Kong Nguyen, Dmitrii Radkevich, Roman Doronin, Andrey Doronichev
First: 2026-02-16T18:57:49+00:00 · Latest: 2026-02-17T18:58:56+00:00
Abstract
Bio-pharmaceutical innovation has shifted: many new drug assets now originate outside the United States and are disclosed primarily via regional, non-English channels. Recent data suggests that over 85% of patent filings originate outside the U.S., with China accounting for nearly half of the global total. A growing share of scholarly output is also non-U.S. Industry estimates put China at 30% of global drug development, spanning 1,200+ novel candidates. In this high-stakes environment, failing to surface "under-the-radar" assets creates multi-billion-dollar risk for investors and business development teams, making asset scouting a coverage-critical competition where speed and completeness drive value. Yet today's Deep Research AI agents still lag human experts in achieving high recall discovery across heterogeneous, multilingual sources without hallucination. We propose a benchmarking methodology for drug asset scouting and a tuned, tree-based self-learning Bioptic Agent aimed at complete, non-hallucinated scouting. We construct a challenging completeness benchmark using a multilingual multi-agent pipeline: complex user queries paired with ground-truth assets that are largely outside U.S.-centric radar. To reflect real-deal complexity, we collected screening queries from expert investors, BD, and VC professionals and used them as priors to conditionally generate benchmark queries. For grading, we use LLM-as-judge evaluation calibrated to expert opinions. On this benchmark, our Bioptic Agent achieves 79.7% F1 score, outperforming Claude Opus 4.6 (56.2%), Gemini 3 Pro + Deep Research (50.6%), OpenAI GPT-5.2 Pro (46.6%), Perplexity Deep Research (44.2%), and Exa Websets (26.9%). Performance improves steeply with additional compute, supporting the view that more compute yields better results.
中文标题/摘要
标题:全球搜寻:广泛搜索AI代理在投资、商务发展和竞争情报中的药物资产勘探
生物医药创新已转变:许多新的药物资产现在起源于美国之外,并主要通过区域性的非英语渠道披露。最新数据显示,超过85%的专利申请来自美国之外,其中中国占全球总量的近一半。非美国的学术产出比例也在增加。行业估计显示,中国在全球药物研发中占30%,涵盖1200多种新型候选药物。在这种高风险环境中,未能发现“非主流”的资产会给投资者和商务发展团队带来数亿美元的风险,使资产勘探成为一项关键的竞争,速度和完整性决定价值。然而,当前的深度研究AI代理在实现跨异构、多语言来源的高召回率发现时仍落后于人类专家,且不产生幻觉。我们提出了一种药物资产勘探的基准测试方法,并开发了一种调优的树状自学习Bioptic代理,旨在实现完整的非幻觉勘探。我们构建了一个具有挑战性的完整性基准,使用多语言多代理管道:复杂用户查询配以主要在美国中心雷达之外的真实资产。为了反映实际复杂性,我们收集了专家投资者、商务发展和风险投资专业人士的筛查查询,并将其作为先验条件生成基准查询。在评分方面,我们使用校准了专家意见的LLM作为裁判进行评估。在该基准上,我们的Bioptic代理取得了79.7%的F1分数,优于Claude Opus 4.6(56.2%)、Gemini 3 Pro + 深度研究(50.6%)、OpenAI GPT-5.2 Pro(46.6%)、Perplexity深度研究(44.2%)和Exa Websets(26.9%)。随着计算资源的增加,性能显著提升,支持了更多计算资源会带来更好结果的观点。
Summary / 总结
The research addresses the challenge of identifying under-the-radar drug assets outside the U.S., where most patent filings and scholarly outputs are non-English. It introduces a Bioptic Agent, a self-learning AI system, to comprehensively scout drug assets without hallucination. The agent was benchmarked using complex queries from industry experts and achieved an F1 score of 79.7%, significantly outperforming other AI systems like Claude Opus and Gemini.
论文针对识别在美国以外地区,尤其是中国等地产生的新药物资产的挑战,这些创新多发生在非英语地区。文中提出了一种名为Bioptic Agent的自学习AI系统,以提高这些资产的发现能力。该系统通过多语言查询管道进行基准测试,并取得了79.7%的F1分数,显著优于Claude Opus等其他系统。这表明AI在提高制药行业资产筛选的速度和完整性方面具有潜力。
stable-worldmodel-v1: Reproducible World Modeling Research and Evaluation
Authors: Lucas Maes, Quentin Le Lidec, Dan Haramati, Nassim Massaudi, Damien Scieur, Yann LeCun, Randall Balestriero
First: 2026-02-09T18:04:22+00:00 · Latest: 2026-02-17T18:58:08+00:00
Abstract
World Models have emerged as a powerful paradigm for learning compact, predictive representations of environment dynamics, enabling agents to reason, plan, and generalize beyond direct experience. Despite recent interest in World Models, most available implementations remain publication-specific, severely limiting their reusability, increasing the risk of bugs, and reducing evaluation standardization. To mitigate these issues, we introduce stable-worldmodel (SWM), a modular, tested, and documented world-model research ecosystem that provides efficient data-collection tools, standardized environments, planning algorithms, and baseline implementations. In addition, each environment in SWM enables controllable factors of variation, including visual and physical properties, to support robustness and continual learning research. Finally, we demonstrate the utility of SWM by using it to study zero-shot robustness in DINO-WM.
中文标题/摘要
标题:stable-worldmodel-v1: 可再现的世界建模研究与评估
世界模型已成为一种强大的范式,用于学习环境动力学的紧凑、预测性表示,使智能体能够推理、规划并超越直接经验进行泛化。尽管最近对世界模型的兴趣增加,但大多数可用实现仍具有出版物特定性,严重限制了其可重用性,增加了错误风险,并降低了评估标准化。为缓解这些问题,我们引入了stable-worldmodel (SWM),这是一个模块化、经过测试和文档化的世界模型研究生态系统,提供了高效的数据收集工具、标准化环境、规划算法和基线实现。此外,SWM 中的每个环境都支持鲁棒性和持续学习研究,允许控制变化因素,包括视觉和物理属性。最后,我们通过使用SWM 研究 DINO-WM 的零样本鲁棒性来展示SWM 的实用性。
CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing
Authors: Zarif Ikram, Arad Firouzkouhi, Stephen Tu, Mahdi Soltanolkotabi, Paria Rashidinejad
First: 2026-02-17T18:58:04+00:00 · Latest: 2026-02-17T18:58:04+00:00
Abstract
A central challenge in large language model (LLM) editing is capability preservation: methods that successfully change targeted behavior can quietly game the editing proxy and corrupt general capabilities, producing degenerate behaviors reminiscent of proxy/reward hacking. We present CrispEdit, a scalable and principled second-order editing algorithm that treats capability preservation as an explicit constraint, unifying and generalizing several existing editing approaches. CrispEdit formulates editing as constrained optimization and enforces the constraint by projecting edit updates onto the low-curvature subspace of the capability-loss landscape. At the crux of CrispEdit is expressing capability constraint via Bregman divergence, whose quadratic form yields the Gauss-Newton Hessian exactly and even when the base model is not trained to convergence. We make this second-order procedure efficient at the LLM scale using Kronecker-factored approximate curvature (K-FAC) and a novel matrix-free projector that exploits Kronecker structure to avoid constructing massive projection matrices. Across standard model-editing benchmarks, CrispEdit achieves high edit success while keeping capability degradation below 1% on average across datasets, significantly improving over prior editors.
中文标题/摘要
标题:CrispEdit:低曲率投影的可扩展非破坏性大语言模型编辑
大规模语言模型(LLM)编辑中的一个主要挑战是能力保留:能够成功改变目标行为的方法可能会悄悄地利用编辑代理并破坏一般能力,产生类似于代理/奖励黑客行为的退化行为。我们提出了CrispEdit,这是一种可扩展且基于原理的二阶编辑算法,将能力保留视为显式约束,统一并泛化了多种现有的编辑方法。CrispEdit将编辑表述为约束优化,并通过将编辑更新投影到能力损失景观的低曲率子空间中来强制执行约束。CrispEdit的核心在于通过Bregman发散表达能力约束,其二次形式精确地给出了Gauss-Newton海森矩阵,即使基础模型未训练至收敛也是如此。我们使用Kronecker因子近似曲率(K-FAC)和一种新颖的矩阵自由投影器,利用Kronecker结构避免构建大规模投影矩阵,使这种二阶过程在LLM规模下高效运行。在标准模型编辑基准测试中,CrispEdit在保持能力退化低于1%的同时实现了高编辑成功率,显著优于先前的编辑器。
Summary / 总结
CrispEdit addresses the challenge of preserving capabilities in large language model editing by formulating editing as constrained optimization and projecting edit updates onto the low-curvature subspace of the capability-loss landscape. It uses Bregman divergence to express capability constraints and employs Kronecker-factored approximate curvature and a novel matrix-free projector to maintain efficiency. Experiments show that CrispEdit successfully edits models with minimal capability degradation, averaging less than 1% degradation across datasets, outperforming previous methods.
CrispEdit通过将编辑问题表述为约束优化问题,并将编辑更新投影到能力损失景观的低曲率子空间中来解决大规模语言模型编辑时保持能力的挑战。它使用Bregman散度来表达能力约束,并采用Kronecker因子近似曲率和一种新型的矩阵自由投影器来保持效率。实验结果显示,CrispEdit在保持能力方面表现出色,平均能力退化率低于1%,显著优于先前的方法。
Stabilizing Test-Time Adaptation of High-Dimensional Simulation Surrogates via D-Optimal Statistics
Authors: Anna Zimmel, Paul Setinek, Gianluca Galletti, Johannes Brandstetter, Werner Zellinger
First: 2026-02-17T18:55:18+00:00 · Latest: 2026-02-17T18:55:18+00:00
Abstract
Machine learning surrogates are increasingly used in engineering to accelerate costly simulations, yet distribution shifts between training and deployment often cause severe performance degradation (e.g., unseen geometries or configurations). Test-Time Adaptation (TTA) can mitigate such shifts, but existing methods are largely developed for lower-dimensional classification with structured outputs and visually aligned input-output relationships, making them unstable for the high-dimensional, unstructured and regression problems common in simulation. We address this challenge by proposing a TTA framework based on storing maximally informative (D-optimal) statistics, which jointly enables stable adaptation and principled parameter selection at test time. When applied to pretrained simulation surrogates, our method yields up to 7% out-of-distribution improvements at negligible computational cost. To the best of our knowledge, this is the first systematic demonstration of effective TTA for high-dimensional simulation regression and generative design optimization, validated on the SIMSHIFT and EngiBench benchmarks.
中文标题/摘要
标题:通过D-最优统计稳定高维模拟代理的测试时适应
机器学习代理在工程中越来越多地用于加速昂贵的模拟,但在训练和部署之间出现的分布变化往往会导致严重的性能下降(例如,未见过的几何形状或配置)。测试时适应(TTA)可以缓解这种变化,但现有的方法主要针对低维分类且具有结构化输出和视觉对齐输入输出关系的问题进行开发,使得它们在模拟中常见的高维、无结构和回归问题上不稳定。我们通过提出基于存储最大信息量(D-最优)统计的TTA框架来应对这一挑战,该框架在测试时同时实现了稳定的适应和原理参数选择。当应用于预训练的模拟代理时,我们的方法在几乎不增加计算成本的情况下,可以提高多达7%的未见过分布性能。据我们所知,这是首次系统地证明有效的高维模拟回归和生成设计优化的测试时适应,并在SIMSHIFT和EngiBench基准上进行了验证。
Summary / 总结
The research aims to improve the performance of machine learning surrogates in engineering simulations by addressing distribution shifts between training and testing. The proposed method uses D-optimal statistics to stabilize test-time adaptation, particularly for high-dimensional regression problems. Experiments show up to 7% out-of-distribution performance improvements with minimal computational overhead.
本文解决了由于训练和部署之间的分布变化导致的工程模拟中机器学习代理性能下降的问题。它提出了一种基于D-最优统计的Test-Time Adaptation (TTA)框架,以稳定高维、非结构化回归问题的适应性。该方法通过高达7%的出域性能提升和几乎无计算成本来实现,并且是首次系统地展示了有效的TTA在高维模拟回归和生成设计优化中的应用,已在SIMSHIFT和EngiBench基准上得到验证。
VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation
Authors: Hui Ren, Yuval Alaluf, Omer Bar Tal, Alexander Schwing, Antonio Torralba, Yael Vinker
First: 2026-02-17T18:55:03+00:00 · Latest: 2026-02-17T18:55:03+00:00
Abstract
Sketching is inherently a sequential process, in which strokes are drawn in a meaningful order to explore and refine ideas. However, most generative models treat sketches as static images, overlooking the temporal structure that underlies creative drawing. We present a data-efficient approach for sequential sketch generation that adapts pretrained text-to-video diffusion models to generate sketching processes. Our key insight is that large language models and video diffusion models offer complementary strengths for this task: LLMs provide semantic planning and stroke ordering, while video diffusion models serve as strong renderers that produce high-quality, temporally coherent visuals. We leverage this by representing sketches as short videos in which strokes are progressively drawn on a blank canvas, guided by text-specified ordering instructions. We introduce a two-stage fine-tuning strategy that decouples the learning of stroke ordering from the learning of sketch appearance. Stroke ordering is learned using synthetic shape compositions with controlled temporal structure, while visual appearance is distilled from as few as seven manually authored sketching processes that capture both global drawing order and the continuous formation of individual strokes. Despite the extremely limited amount of human-drawn sketch data, our method generates high-quality sequential sketches that closely follow text-specified orderings while exhibiting rich visual detail. We further demonstrate the flexibility of our approach through extensions such as brush style conditioning and autoregressive sketch generation, enabling additional controllability and interactive, collaborative drawing.
中文标题/摘要
标题:VideoSketcher:视频模型先验使顺序素描生成多样化
素描本质上是一个顺序过程,在这个过程中,按照有意义的顺序绘制线条以探索和细化想法。然而,大多数生成模型将素描视为静态图像,忽视了创造性绘画所依赖的时间结构。我们提出了一种数据高效的方法,将预训练的文本到视频扩散模型适应以生成素描过程。我们的关键见解是,大型语言模型和视频扩散模型在这一任务中提供了互补的优势:LLMs 提供语义规划和线条顺序,而视频扩散模型作为强大的渲染器,产生高质量、时间上连贯的视觉效果。我们通过将素描表示为短视频来利用这一点,在这些视频中,线条在空白画布上逐步绘制,受文本指定的顺序指令引导。我们引入了一种两阶段微调策略,将线条顺序的学习与素描外观的学习分离。线条顺序使用具有受控时间结构的合成形状组成来学习,而视觉外观则从七个手动撰写的素描过程中提取,这些过程捕捉了全局绘画顺序和单个线条的连续形成。尽管人类绘制的素描数据极其有限,但我们的方法生成了高质量的顺序素描,这些素描紧密遵循文本指定的顺序,同时表现出丰富的视觉细节。我们还通过扩展如笔触风格条件和自回归素描生成,进一步展示了我们方法的灵活性,从而实现更多的可控性和交互式、协作式绘画。
Developing AI Agents with Simulated Data: Why, what, and how?
Authors: Xiaoran Liu, Istvan David
First: 2026-02-17T18:53:27+00:00 · Latest: 2026-02-17T18:53:27+00:00
Abstract
As insufficient data volume and quality remain the key impediments to the adoption of modern subsymbolic AI, techniques of synthetic data generation are in high demand. Simulation offers an apt, systematic approach to generating diverse synthetic data. This chapter introduces the reader to the key concepts, benefits, and challenges of simulation-based synthetic data generation for AI training purposes, and to a reference framework to describe, design, and analyze digital twin-based AI simulation solutions.
中文标题/摘要
标题:使用模拟数据开发AI代理:为什么、什么和怎么做?
由于数据量和质量不足仍然是现代非符号AI采用的关键障碍,合成数据生成技术需求很高。模拟提供了一种合适的、系统的方法来生成多样化的合成数据。本章向读者介绍了基于模拟的合成数据生成的关键概念、优势和挑战,以及用于AI训练的数字孪生基模拟解决方案的参考框架描述、设计和分析框架。
Avey-B
Authors: Devang Acharya, Mohammad Hammoud
First: 2026-02-17T18:50:40+00:00 · Latest: 2026-02-17T18:50:40+00:00
Abstract
Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention's ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.
中文标题/摘要
标题:Avey-B
紧凑的预训练双向编码器在计算和内存预算紧张的工业NLP中仍然是核心。它们的有效性源于自注意力机制能够通过序列级并行性提供高质量的双向上下文化。最近,Avey 作为一种自回归、无注意力的替代方案被引入,自然地适用于仅编码器的适应。在本文中,我们重新构想了Avey的仅编码器范式,并对其架构提出了多项创新,包括解耦静态和动态参数化、稳定性导向的规范化和神经压缩。结果显示,这种重新构想的架构在标准的标记分类和信息检索基准测试中优于四种广泛使用的基于Transformer的编码器,同时在长上下文方面更高效地扩展。
Summary / 总结
This paper aims to enhance the Avey model, an autoregressive, attention-free architecture, for use as an encoder-only model in NLP tasks. The authors introduce several architectural innovations such as decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. The reformulated Avey model outperforms four commonly used Transformer-based encoders on standard benchmarks for token-classification and information-retrieval, while also scaling more efficiently to longer sequences.
本文旨在通过将Avey重新构造成编码器-only架构来提高其有效性。作者引入了多个架构创新,包括静态和动态参数的解耦、稳定性导向的规范化和神经压缩。重新构造成的Avey在标准的标记分类和信息检索基准测试中优于四种常用的Transformer编码器,同时在长上下文上更具扩展性。
Decision Quality Evaluation Framework at Pinterest
Authors: Yuqi Tian, Robert Paine, Attila Dobi, Kevin O'Sullivan, Aravindh Manickavasagam, Faisal Farooq
First: 2026-02-17T18:45:55+00:00 · Latest: 2026-02-17T18:45:55+00:00
Abstract
Online platforms require robust systems to enforce content safety policies at scale. A critical component of these systems is the ability to evaluate the quality of moderation decisions made by both human agents and Large Language Models (LLMs). However, this evaluation is challenging due to the inherent trade-offs between cost, scale, and trustworthiness, along with the complexity of evolving policies. To address this, we present a comprehensive Decision Quality Evaluation Framework developed and deployed at Pinterest. The framework is centered on a high-trust Golden Set (GDS) curated by subject matter experts (SMEs), which serves as a ground truth benchmark. We introduce an automated intelligent sampling pipeline that uses propensity scores to efficiently expand dataset coverage. We demonstrate the framework's practical application in several key areas: benchmarking the cost-performance trade-offs of various LLM agents, establishing a rigorous methodology for data-driven prompt optimization, managing complex policy evolution, and ensuring the integrity of policy content prevalence metrics via continuous validation. The framework enables a shift from subjective assessments to a data-driven and quantitative practice for managing content safety systems.
中文标题/摘要
标题:Pinterest的内容质量评估框架
在线平台需要强大的系统来大规模执行内容安全政策。这些系统的关键组成部分是评估由人类代理和大型语言模型(LLMs)做出的 Moderation 决策质量的能力。然而,这种评估由于成本、规模和可信度之间的固有权衡,以及不断变化的政策复杂性而具有挑战性。为了解决这个问题,我们提出了一个全面的内容质量评估框架,该框架在Pinterest开发和部署。该框架以由领域专家(SMEs)精心策划的高可信度金集(GDS)为中心,作为基准。我们引入了一种自动智能抽样管道,使用倾向得分来高效地扩展数据集覆盖范围。我们展示了该框架在几个关键领域的实际应用:基准测试各种LLM代理的成本-性能权衡,建立数据驱动的提示优化的严格方法,管理复杂的政策演变,并通过持续验证确保政策内容出现率指标的完整性。该框架使内容安全系统的管理从主观评估转变为数据驱动和定量实践。
Summary / 总结
The research aims to develop a robust system for evaluating the quality of moderation decisions in online platforms, particularly focusing on the trade-offs between cost, scale, and trustworthiness. The method involves creating a high-trust Golden Set curated by experts and an automated sampling pipeline using propensity scores. Key findings include the framework's effectiveness in benchmarking LLM agents, optimizing prompts, managing policy evolution, and validating content metrics, shifting from subjective to data-driven assessments.
研究旨在开发一个系统来评估大规模下人类代理和大型语言模型(LLMs)的决策质量。框架引入了一个由专家编纂的高可信度金集和使用倾向得分的自动化采样管道。关键发现包括框架在成本-性能权衡、优化提示、管理政策演变和确保内容出现指标的连续验证方面的有效性。从主观评估转向数据驱动的评估增强了内容安全系统的可信度。
Should You Use Your Large Language Model to Explore or Exploit?
Authors: Keegan Harris, Aleksandrs Slivkins
First: 2025-01-31T23:42:53+00:00 · Latest: 2026-02-17T18:41:00+00:00
Abstract
We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exploitation tradeoff. While previous work has largely study the ability of LLMs to solve combined exploration-exploitation tasks, we take a more systematic approach and use LLMs to explore and exploit in silos in various (contextual) bandit tasks. We find that reasoning models show the most promise for solving exploitation tasks, although they are still too expensive or too slow to be used in many practical settings. Motivated by this, we study tool use and in-context summarization using non-reasoning models. We find that these mitigations may be used to substantially improve performance on medium-difficulty tasks, however even then, all LLMs we study perform worse than a simple linear regression, even in non-linear settings. On the other hand, we find that LLMs do help at exploring large action spaces with inherent semantics, by suggesting suitable candidates to explore.
中文标题/摘要
标题:你应该使用大型语言模型进行探索还是利用?
我们评估了当前一代大型语言模型(LLMs)在面对探索-利用权衡时帮助决策代理的能力。虽然以往的工作主要研究LLMs解决结合探索-利用任务的能力,但我们采取了更系统的方法,使用LLMs分别进行探索和利用在各种(上下文)多臂老虎机任务中。我们发现,推理模型在解决利用任务方面最有前景,尽管它们仍然太昂贵或太慢,无法在许多实际环境中使用。受此启发,我们研究了工具使用和上下文总结,使用非推理模型。我们发现,这些缓解措施可以显著提高中等难度任务的性能,然而即使如此,我们研究的所有LLMs的表现仍然不如简单的线性回归,即使在非线性环境中也是如此。另一方面,我们发现,LLMs在探索具有内在语义的大动作空间时确实有所帮助,通过建议合适的探索候选对象。
The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety
Authors: Max Springer, Chung Peng Lee, Blossom Metevier, Jane Castleman, Bohdan Turbal, Hayoung Jung, Zeyu Shen, Aleksandra Korolova
First: 2026-02-17T18:39:15+00:00 · Latest: 2026-02-17T18:39:15+00:00
Comments: 27 pages, 4 figures
Abstract
Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions. We formalize this mechanism through the Alignment Instability Condition, three geometric properties that, when jointly satisfied, lead to safety degradation. Our main result establishes a quartic scaling law: alignment loss grows with the fourth power of training time, governed by the sharpness of alignment geometry and the strength of curvature coupling between the fine-tuning task and safety-critical parameters. These results expose a structural blind spot in the current safety paradigm. The dominant approaches to safe fine-tuning address only the initial snapshot of a fundamentally dynamic problem. Alignment fragility is not a bug to be patched; it is an intrinsic geometric property of gradient descent on curved manifolds. Our results motivate the development of curvature-aware methods, and we hope will further enable a shift in alignment safety analysis from reactive red-teaming to predictive diagnostics for open-weight model deployment.
中文标题/摘要
标题:对齐崩塌的几何学:微调如何破坏安全性
在无害任务上对齐的语言模型的微调意外地降低了安全性保障,即使训练数据中不含有害内容且开发人员无敌意意图。我们表明,当前解释认为微调更新应在高维参数空间中与关键的安全方向正交,这种解释提供了虚假的安全感:我们证明这种正交性在梯度下降的动力学下是结构不稳定的,并会崩溃。然后我们通过一种新颖的几何分析解决了这一问题,证明了对齐集中在具有尖锐曲率的低维子空间中,形成了微分方法无法检测或防御的脆弱结构。虽然初始微调更新确实可以避免这些子空间,但微调损失的曲率会产生二次加速,系统地将轨迹引导至对齐敏感区域。我们通过对齐不稳定性条件形式化了这一机制,三个几何属性的联合满足会导致安全性下降。我们的主要结果建立了四次方律:对齐损失随训练时间的四次方增长,由对齐几何的尖锐度和微调任务与关键安全参数之间曲率耦合的强度控制。这些结果揭示了当前安全性范式中的结构盲点。主流的稳健微调方法仅解决了这一根本动态问题的初始快照。对齐脆弱性不是需要修补的漏洞;它是梯度下降在曲率流形上的固有几何属性。我们的结果促使开发曲率感知方法,并希望进一步推动对开放权重模型部署的对齐安全性分析从反应性红队测试转向预测性诊断。
Summary / 总结
The study investigates why fine-tuning aligned language models on benign tasks can unpredictably degrade safety, even without harmful training data. It challenges the prevailing orthogonality hypothesis and introduces a geometric analysis showing that alignment concentrates in low-dimensional subspaces with sharp curvature, leading to alignment loss scaling quartically with training time. The research highlights the need for curvature-aware methods to address the intrinsic geometric property of gradient descent on curved manifolds, suggesting a shift from reactive to predictive diagnostics for alignment safety analysis.
论文研究了即使训练数据中没有有害内容,对齐的语言模型在进行良性任务的微调时为何会意外地降低安全性。它挑战了现有的正交性解释,并引入了一种新的几何分析,表明对齐集中在具有尖锐曲率的低维子空间中,导致对齐损失随训练时间的四次方增长。关键发现是对齐不稳定性条件,该条件识别了三个几何属性,导致安全性下降。这项工作强调了需要使用曲率感知方法来进行安全微调,并推动对开放权重模型部署的预测诊断分析。
Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI
Authors: Ziyan Wang, Longlong Ma
First: 2026-02-09T09:50:12+00:00 · Latest: 2026-02-17T18:26:38+00:00
Abstract
In Chomsky's provocative critique "The False Promise of CHATGPT," Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, therefore are not able to distinguish impossible languages. It stands as a representative in a fundamental challenge to the intellectual foundations of AI, for it integrally synthesizes major issues in methodologies within LLMs and possesses an iconic a priori rationalist perspective. We examine this famous critique from both the perspective in pre-existing literature of linguistics and psychology as well as a research based on an experiment inquiring into the capacity of learning both possible and impossible languages among LLMs. We constructed a set of syntactically impossible languages by applying certain transformations to English. These include reversing whole sentences, and adding negation based on word-count parity. Two rounds of controlled experiments were each conducted on GPT-2 small models and long short-term memory (LSTM) models. Statistical analysis (Welch's t-test) shows GPT2 small models underperform in learning all of the impossible languages compared to their performance on the possible language (p<.001). On the other hand, LSTM models' performance tallies with Chomsky's argument, suggesting the irreplaceable role of the evolution of transformer architecture. Based on theoretical analysis and empirical findings, we propose a new vision within Chomsky's theory towards LLMs, and a shift of theoretical paradigm outside Chomsky, from his "rationalist-romantics" paradigm to functionalism and empiricism in LLMs research.
中文标题/摘要
标题:大型语言模型与不可能的语言习得:“虚假承诺”还是对当前AI视角的颠覆
在乔姆斯基的激进批评《CHATGPT的虚假承诺》中,大型语言模型(LLMs)被描述为仅仅是模式预测器,它们无法通过内在因果结构和自我纠正机制像人类那样习得语言,因此无法区分不可能的语言。这代表了对AI智力基础的根本挑战,因为它综合了LLMs方法论中的主要问题,并具有先验理性主义的标志性视角。我们从语言学和心理学的既有文献视角以及一项实验研究的视角,探讨了这一著名批评,该实验研究了LLMs在学习可能和不可能语言方面的能力。我们通过将某些变换应用于英语,构建了一组句法上不可能的语言。这些包括反转整个句子,以及基于词数奇偶性添加否定。分别对GPT-2小型模型和长短期记忆(LSTM)模型进行了两轮受控实验。统计分析(Welch's t检验)显示,GPT2小型模型在学习所有不可能语言方面的表现均劣于其在可能语言上的表现(p<.001)。另一方面,LSTM模型的表现与乔姆斯基的论点相符,表明了变压器架构进化不可替代的作用。基于理论分析和实证发现,我们提出了乔姆斯基理论中关于LLMs的新视角,并在乔姆斯基之外提出了理论范式的转变,从他的“理性主义浪漫主义”范式转向LLMs研究中的功能主义和经验主义。
Enhancing Building Semantics Preservation in AI Model Training with Large Language Model Encodings
Authors: Suhyung Jang, Ghang Lee, Jaekun Lee, Hyunjun Lee
First: 2026-02-17T18:26:36+00:00 · Latest: 2026-02-17T18:26:36+00:00
Comments: 42nd International Symposium on Automation and Robotics in Construction (ISARC 2025)
Abstract
Accurate representation of building semantics, encompassing both generic object types and specific subtypes, is essential for effective AI model training in the architecture, engineering, construction, and operation (AECO) industry. Conventional encoding methods (e.g., one-hot) often fail to convey the nuanced relationships among closely related subtypes, limiting AI's semantic comprehension. To address this limitation, this study proposes a novel training approach that employs large language model (LLM) embeddings (e.g., OpenAI GPT and Meta LLaMA) as encodings to preserve finer distinctions in building semantics. We evaluated the proposed method by training GraphSAGE models to classify 42 building object subtypes across five high-rise residential building information models (BIMs). Various embedding dimensions were tested, including original high-dimensional LLM embeddings (1,536, 3,072, or 4,096) and 1,024-dimensional compacted embeddings generated via the Matryoshka representation model. Experimental results demonstrated that LLM encodings outperformed the conventional one-hot baseline, with the llama-3 (compacted) embedding achieving a weighted average F1-score of 0.8766, compared to 0.8475 for one-hot encoding. The results underscore the promise of leveraging LLM-based encodings to enhance AI's ability to interpret complex, domain-specific building semantics. As the capabilities of LLMs and dimensionality reduction techniques continue to evolve, this approach holds considerable potential for broad application in semantic elaboration tasks throughout the AECO industry.
中文标题/摘要
标题:使用大型语言模型编码增强建筑语义在AI模型训练中的保存
准确表示建筑语义,包括通用对象类型和特定子类型,对于建筑、工程、施工和运营(AECO)行业的有效AI模型训练至关重要。传统的编码方法(例如one-hot)往往无法传达密切相关的子类型之间的细微关系,限制了AI的语义理解能力。为了解决这一限制,本研究提出了一种新的训练方法,该方法利用大型语言模型(LLM)嵌入(例如OpenAI GPT和Meta LLaMA)作为编码,以保存建筑语义中的细微差异。我们通过训练GraphSAGE模型对五栋高层住宅建筑信息模型(BIM)中的42种建筑对象子类型进行分类,来评估所提出的方法。测试了各种嵌入维度,包括原始高维LLM嵌入(1,536、3,072或4,096)和通过Matryoshka表示模型生成的1,024维紧凑嵌入。实验结果表明,LLM编码优于传统的one-hot基线,其中llama-3(紧凑)嵌入的加权平均F1分数为0.8766,而one-hot编码为0.8475。结果表明,利用基于LLM的编码可以增强AI对复杂、领域特定建筑语义的解释能力。随着LLM能力和降维技术的不断进步,这种方法在AECO行业的语义细化任务中具有广泛的应用潜力。
Summary / 总结
This study aims to improve the representation of building semantics in AI model training by using large language model (LLM) embeddings, which better capture nuanced subtype relationships compared to conventional one-hot encoding methods. The research evaluated GraphSAGE models trained on 42 building object subtypes from five BIMs, using embeddings of varying dimensions. The results showed that LLM encodings, especially the compacted llama-3 embedding, outperformed one-hot encoding, achieving a higher weighted average F1-score of 0.8766.
该研究旨在通过使用大型语言模型(LLM)嵌入来改进建筑语义在AI模型训练中的表示,这些嵌入能够更好地捕捉建筑子类型之间的细微关系,优于传统的one-hot编码。研究评估了使用各种LLM嵌入训练的GraphSAGE模型,包括原始高维嵌入和压缩嵌入,并发现LLM编码优于one-hot编码,其中llama-3压缩嵌入的加权平均F1得分为0.8766,远高于one-hot编码的0.8475。这表明LLM基编码在增强AI对复杂建筑语义的解释能力方面具有潜在的应用价值,特别是在AECO行业中。
This human study did not involve human subjects: Validating LLM simulations as behavioral evidence
Authors: Jessica Hullman, David Broska, Huaman Sun, Aaron Shaw
First: 2026-02-17T18:18:38+00:00 · Latest: 2026-02-17T18:18:38+00:00
Abstract
A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about human behavior. We contrast two strategies for obtaining valid estimates of causal effects and clarify the assumptions under which each is suitable for exploratory versus confirmatory research. Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable through prompt engineering, model fine-tuning, and other repair strategies designed to reduce LLM-induced inaccuracies. While useful for many exploratory tasks, heuristic approaches lack the formal statistical guarantees typically required for confirmatory research. In contrast, statistical calibration combines auxiliary human data with statistical adjustments to account for discrepancies between observed and simulated responses. Under explicit assumptions, statistical calibration preserves validity and provides more precise estimates of causal effects at lower cost than experiments that rely solely on human participants. Yet the potential of both approaches depends on how well LLMs approximate the relevant populations. We consider what opportunities are overlooked when researchers focus myopically on substituting LLMs for human participants in a study.
中文标题/摘要
标题:本研究未涉及人类受试者:验证LLM模拟作为行为证据的有效性
越来越多的研究使用大型语言模型(LLMs)作为合成参与者,以在社会科学研究实验中生成低成本且几乎即时的响应。然而,对于此类模拟何时能支持对人类行为的有效推断,指导有限。我们对比了获得因果效应有效估计的两种策略,并澄清了每种策略在探索性研究与验证性研究中的适用条件。启发式方法通过提示工程、模型微调和其他修复策略来减少LLM引起的不准确性,以建立模拟和观察到的人类行为的可互换性。虽然对于许多探索性任务很有用,但启发式方法缺乏通常要求的确认性研究所需的正式统计保证。相比之下,统计校准结合辅助的人类数据和统计调整来弥补观察到的和模拟响应之间的差异。在明确假设下,统计校准保持了有效性,并提供了比仅依赖人类参与者实验更低的成本和更精确的因果效应估计。然而,这两种方法的潜力取决于LLM对相关人群的近似程度。我们探讨了当研究人员狭隘地将LLM作为研究中的替代人类参与者时所忽视的机会。
Meteorological data and Sky Images meets Neural Models for Photovoltaic Power Forecasting
Authors: Ines Montoya-Espinagosa, Antonio Agudo
First: 2026-02-17T18:14:15+00:00 · Latest: 2026-02-17T18:14:15+00:00
Comments: CAI 2026
Abstract
Due to the rise in the use of renewable energies as an alternative to traditional ones, and especially solar energy, there is increasing interest in studying how to address photovoltaic forecasting in the face of the challenge of variability in photovoltaic energy production, using different methodologies. This work develops a hybrid approach for short and long-term forecasting based on two studies with the same purpose. A multimodal approach that combines images of the sky and photovoltaic energy history with meteorological data is proposed. The main goal is to improve the accuracy of ramp event prediction, increase the robustness of forecasts in cloudy conditions, and extend capabilities beyond nowcasting, to support more efficient operation of the power grid and better management of solar variability. Deep neural models are used for both nowcasting and forecasting solutions, incorporating individual and multiple meteorological variables, as well as an analytical solar position. The results demonstrate that the inclusion of meteorological data, particularly the surface long-wave, radiation downwards, and the combination of wind and solar position, significantly improves current predictions in both nowcasting and forecasting tasks, especially on cloudy days. This study highlights the importance of integrating diverse data sources to improve the reliability and interpretability of solar energy prediction models.
中文标题/摘要
标题:气象数据与天空图像结合神经模型进行光伏功率预测
由于可再生能源尤其是太阳能的使用增加,人们对如何应对光伏能源生产波动性进行预测的兴趣日益浓厚,使用不同的方法。本研究开发了一种基于两个相同目的研究的混合方法,用于短期和长期预测。提出了一种多模态方法,结合天空图像、光伏能源历史和气象数据。主要目标是提高梯度事件预测的准确性,增强在阴天条件下预测的稳健性,并将能力扩展到不仅仅是现在预测,以支持电力系统的更高效运行和更好地管理太阳能的波动性。使用深度神经模型进行现在预测和预测解决方案,结合单个和多个气象变量以及分析性太阳位置。结果表明,气象数据的纳入,特别是地表长波辐射、向下辐射以及风和太阳位置的结合,显著提高了现在预测和预测任务中的当前预测,尤其是在阴天的日子里。本研究强调了整合多种数据源以提高太阳能预测模型可靠性和可解释性的的重要性。
Summary / 总结
This study aims to improve photovoltaic power forecasting by integrating meteorological data, sky images, and historical photovoltaic energy data using deep neural models. The approach enhances the accuracy of ramp event prediction and robustness in cloudy conditions, supporting efficient grid operation and better management of solar variability. Key findings show that incorporating meteorological variables like surface long-wave radiation and wind-solar position significantly improves predictions, particularly on cloudy days.
该研究旨在通过结合气象数据、天空图像和历史光伏能量数据,使用深度神经模型来提高光伏功率预测的准确性。研究结果显示,将表面长波辐射、风速等气象变量以及太阳位置纳入模型,显著提升了现在预测和长期预测的准确性,尤其是在多云天气下。这种方法支持电网更高效的运行和更好地管理太阳能的波动性。
Neural Scaling Laws for Boosted Jet Tagging
Authors: Matthias Vigl, Nicole Hartman, Michael Kagan, Lukas Heinrich
First: 2026-02-17T18:13:01+00:00 · Latest: 2026-02-17T18:13:01+00:00
Comments: 9 pages, 6 figures
Abstract
The success of Large Language Models (LLMs) has established that scaling compute, through joint increases in model capacity and dataset size, is the primary driver of performance in modern machine learning. While machine learning has long been an integral component of High Energy Physics (HEP) data analysis workflows, the compute used to train state-of-the-art HEP models remains orders of magnitude below that of industry foundation models. With scaling laws only beginning to be studied in the field, we investigate neural scaling laws for boosted jet classification using the public JetClass dataset. We derive compute optimal scaling laws and identify an effective performance limit that can be consistently approached through increased compute. We study how data repetition, common in HEP where simulation is expensive, modifies the scaling yielding a quantifiable effective dataset size gain. We then study how the scaling coefficients and asymptotic performance limits vary with the choice of input features and particle multiplicity, demonstrating that increased compute reliably drives performance toward an asymptotic limit, and that more expressive, lower-level features can raise the performance limit and improve results at fixed dataset size.
中文标题/摘要
标题:神经扩展定律在增强型喷流分类中的应用
大型语言模型(LLMs)的成功表明,通过联合增加模型容量和数据集大小来扩展计算是现代机器学习性能的主要驱动力。虽然机器学习一直是高能物理(HEP)数据分析工作流的重要组成部分,但用于训练先进HEP模型的计算资源仍比行业基础模型低几个数量级。随着扩展定律在该领域的研究刚刚开始,我们使用公共JetClass数据集研究了增强型喷流分类的神经扩展定律。我们推导出计算最优扩展定律,并确定了一个可以通过增加计算能力一致接近的有效性能极限。我们研究了数据重复如何修改扩展,这在HEP中很常见,因为模拟成本高昂,从而量化了有效数据集大小的增益。然后我们研究了扩展系数和渐近性能极限如何随输入特征和粒子多重度的选择而变化,证明了增加计算能力可靠地将性能推向渐近极限,并且更具表现力的低级特征可以提高性能极限并在固定数据集大小下改善结果。
Summary / 总结
This study investigates neural scaling laws for boosted jet classification in High Energy Physics using the JetClass dataset. By increasing compute resources, the research identifies optimal scaling laws and an effective performance limit that can be consistently approached. The study also examines how data repetition, common in HEP, modifies scaling, leading to a quantifiable effective dataset size gain. Additionally, it explores how scaling coefficients and performance limits vary with different input features and particle multiplicity, showing that increased compute drives performance toward an asymptotic limit and that more expressive features can improve results at a fixed dataset size.
该研究利用JetClass数据集探讨了增强型喷流分类的神经网络扩展定律。通过增加计算资源,研究得出了最优扩展定律,并识别出一种可一致接近的性能上限。此外,研究还考察了数据重复对扩展的影响,并发现更具表现力的特征可以提高性能上限并在固定数据集大小下改善结果。
*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation
Authors: Quentin Lemesle, Léane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu, Arnaud Delhay, Damien Lolive
First: 2026-02-17T18:10:00+00:00 · Latest: 2026-02-17T18:10:00+00:00
Comments: Under review
Abstract
Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.
中文标题/摘要
标题:*-PLUIE:个性化评价指标,利用LLM改进评估
自动生成文本的质量评估通常依赖于LLM作为裁判(LLM裁判)的方法。虽然这些方法有效,但它们计算成本高且需要后处理。为了解决这些限制,我们基于ParaPLUIE构建了一个基于困惑度的LLM裁判指标,该指标通过估计“是/否”答案的信心来估算置信度,而不生成文本。我们引入了*-PLUIE,这是ParaPLUIE针对特定任务的提示变体,并评估了它们与人类判断的一致性。我们的实验表明,个性化的*-PLUIE在保持低计算成本的同时,与人类评分的相关性更强。
Summary / 总结
The research aims to improve the evaluation of automatically generated text by addressing the computational inefficiency and post-processing requirements of existing LLM-judge methods. The study builds upon ParaPLUIE, a perplexity-based metric, and introduces *-PLUIE, which are task-specific prompting variants designed to better align with human judgment. The experiments demonstrate that *-PLUIE achieves higher correlations with human ratings while keeping low computational costs.
研究旨在通过解决现有LLM-judge方法的计算效率低和需要后处理的问题,改进自动生成文本的质量评估。该研究基于ParaPLUIE,一种基于困惑度的指标,并引入了*-PLUIE,这是一种针对特定任务的提示变体。实验结果表明,*-PLUIE在保持低计算成本的同时,与人类评分的匹配度更高。
GlobeDiff: State Diffusion Process for Partial Observability in Multi-Agent Systems
Authors: Yiqin Yang, Xu Yang, Yuhua Jiang, Ni Mu, Hao Hu, Runpeng Xie, Ziyou Zhang, Siyuan Li, Yuan-Hua Ni, Qianchuan Zhao, Bo Xu
Venue: ICLR
First: 2026-02-17T18:05:48+00:00 · Latest: 2026-02-17T18:05:48+00:00
Abstract
In the realm of multi-agent systems, the challenge of \emph{partial observability} is a critical barrier to effective coordination and decision-making. Existing approaches, such as belief state estimation and inter-agent communication, often fall short. Belief-based methods are limited by their focus on past experiences without fully leveraging global information, while communication methods often lack a robust model to effectively utilize the auxiliary information they provide. To solve this issue, we propose Global State Diffusion Algorithm~(GlobeDiff) to infer the global state based on the local observations. By formulating the state inference process as a multi-modal diffusion process, GlobeDiff overcomes ambiguities in state estimation while simultaneously inferring the global state with high fidelity. We prove that the estimation error of GlobeDiff under both unimodal and multi-modal distributions can be bounded. Extensive experimental results demonstrate that GlobeDiff achieves superior performance and is capable of accurately inferring the global state.
中文标题/摘要
标题:GlobeDiff:全局状态扩散过程在多智能体系统中部分可观测性问题的解决
在多智能体系统领域,\emph{部分可观测性}是有效协调和决策的关键障碍。现有方法,如信念状态估计和智能体间通信,往往效果不佳。基于信念的方法受限于仅关注过去经验,未能充分利用全局信息,而通信方法则缺乏有效利用辅助信息的稳健模型。为解决这一问题,我们提出了全局状态扩散算法(GlobeDiff),基于局部观测推断全局状态。通过将状态推断过程建模为多模态扩散过程,GlobeDiff克服了状态估计的不确定性,同时以高保真度推断全局状态。我们证明了在单模态和多模态分布下,GlobeDiff的估计误差可以被限制。大量实验结果表明,GlobeDiff性能优越,能够准确推断全局状态。
Summary / 总结
The paper addresses the challenge of partial observability in multi-agent systems by proposing GlobeDiff, a global state diffusion algorithm. By treating state inference as a multi-modal diffusion process, GlobeDiff improves upon existing methods by leveraging global information and reducing estimation errors. Experimental results show that GlobeDiff outperforms other methods in accurately inferring the global state.
论文提出了一种全局状态扩散算法GlobeDiff,以解决多智能体系统中的部分可观测性问题。该算法将状态推理过程建模为多模态扩散过程,以克服不确定性并准确推断全局状态。实验结果表明,GlobeDiff 在性能上优于现有方法,并能够有效地估计全局状态。
Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models
Authors: Sen Ye, Mengde Xu, Shuyang Gu, Di He, Liwei Wang, Han Hu
First: 2026-02-17T18:04:13+00:00 · Latest: 2026-02-17T18:04:13+00:00
Comments: Accepted to ICLR2026
Abstract
Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify the primary cause might be the potential conflict between generation and understanding, which creates a competitive dynamic within the model. To address this, we propose the Reason-Reflect-Refine (R3) framework. This innovative algorithm re-frames the single-step generation task into a multi-step process of "generate-understand-regenerate". By explicitly leveraging the model's understanding capability during generation, we successfully mitigate the optimization dilemma, achieved stronger generation results and improved understanding ability which are related to the generation process. This offers valuable insights for designing next-generation unified multimodal models. Code is available at https://github.com/sen-ye/R3.
中文标题/摘要
标题:理解与生成:多模态模型优化困境的导航
当前的多模态模型研究面临一个关键挑战,即增强生成能力往往以牺牲理解能力为代价,反之亦然。我们分析了这种权衡,并认为主要原因是生成与理解之间的潜在冲突,这在模型内部创造了一种竞争动态。为了解决这一问题,我们提出了Reason-Reflect-Refine (R3)框架。这一创新算法将单步生成任务重新构建成“生成-理解-再生”的多步过程。通过明确利用模型的理解能力进行生成,我们成功地缓解了优化困境,实现了更强的生成结果并提高了与生成过程相关的理解能力。这为设计下一代统一多模态模型提供了宝贵的见解。代码可在https://github.com/sen-ye/R3获取。
Summary / 总结
The paper addresses the challenge in multimodal models where enhancing generative capabilities often reduces understanding, and vice versa. It proposes the Reason-Reflect-Refine (R3) framework, which restructures the generation process into a multi-step 'generate-understand-regenerate' cycle. This approach leverages the model's understanding capability during generation, leading to improved generation results and understanding, thus mitigating the optimization dilemma. The R3 framework provides valuable insights for designing next-generation unified multimodal models.
论文探讨了多模态模型中生成能力提升往往会导致理解能力下降,反之亦然的问题。提出了Reason-Reflect-Refine (R3)框架,将生成过程重新构建成‘生成-理解-再生成’的多步循环。这种方法在生成过程中利用模型的理解能力,从而提高了生成结果的质量并增强了与生成过程相关的理解能力。
Advanced Assistance for Traffic Crash Analysis: An AI-Driven Multi-Agent Approach to Pre-Crash Reconstruction
Authors: Gerui Xu, Boyou Chen, Huizhong Guo, Dave LeBlanc, Arpan Kusari, Efe Yarbasi, Ananna Ahmed, Zhaonan Sun, Shan Bao
First: 2025-11-13T23:32:25+00:00 · Latest: 2026-02-17T18:03:17+00:00
Comments: 36 pages, 14 figures
Abstract
Traffic collision reconstruction traditionally relies on human expertise and can be accurate, but pre-crash reconstruction is more challenging. This study develops a multi-agent AI framework that reconstructs pre-crash scenarios and infers vehicle behaviors from fragmented collision data. We propose a two-phase collaborative framework with reconstruction and reasoning stages. The system processes 277 rear-end lead vehicle deceleration (LVD) crashes from the Crash Investigation Sampling System (CISS, 2017 to 2022), integrating narrative reports, structured tabular variables, and scene diagrams. Phase I generates natural-language crash reconstructions from multimodal inputs. Phase II combines these reconstructions with Event Data Recorder (EDR) signals to (1) identify striking and struck vehicles and (2) isolate the EDR records most relevant to the collision moment, enabling inference of key pre-crash behaviors. For validation, we evaluated all LVD cases and emphasized 39 complex crashes where multiple EDR records per crash created ambiguity due to missing or conflicting data. Ground truth was set by consensus of two independent manual annotators, with a separate language model used only to flag potential conflicts for re-checking. The framework achieved 100% accuracy across 4,155 trials; three reasoning models produced identical outputs, indicating that performance is driven by the structured prompts rather than model choice. Research analysts without reconstruction training achieved 92.31% accuracy on the same 39 complex cases. Ablation tests showed that removing structured reasoning anchors reduced case-level accuracy from 99.7% to 96.5% and increased errors across multiple output dimensions. The system remained robust under incomplete inputs. This zero-shot evaluation, without domain-specific training or fine-tuning, suggests a scalable approach for AI-assisted pre-crash analysis.
中文标题/摘要
标题:基于AI驱动多智能体方法的交通碰撞前碰撞重建高级辅助:一种预碰撞重建方法
传统上,交通碰撞重建依赖于人类专业知识,可以很准确,但预碰撞重建更具挑战性。本研究开发了一种多智能体AI框架,用于从碎片化的碰撞数据中重建预碰撞场景并推断车辆行为。我们提出了一种两阶段协作框架,包括重建和推理阶段。系统处理了2017年至2022年CISS的277起追尾前车减速(LVD)碰撞案例,整合了叙述报告、结构化表格变量和场景图。第一阶段从多模态输入生成自然语言碰撞重建。第二阶段将这些重建与事件数据记录器(EDR)信号结合,(1)识别撞击车辆和被撞击车辆,(2)隔离与碰撞时刻最相关的EDR记录,从而推断关键的预碰撞行为。为了验证,我们评估了所有LVD案例,并强调了39起复杂的碰撞案例,其中每起碰撞有多条EDR记录,由于数据缺失或冲突导致存在歧义。事实真相由两名独立的手动注释者达成一致设定,另一个语言模型仅用于标记潜在冲突以供复查。该框架在4,155次试验中实现了100%的准确率;三种推理模型产生了相同的结果,表明性能由结构化提示驱动而非模型选择。未经重建训练的研究分析师在相同的39起复杂案例中达到了92.31%的准确率。消融测试表明,移除结构化推理锚点将案例级准确率从99.7%降低到96.5%,并在多个输出维度上增加了错误。系统在不完整输入下仍保持稳健。这种零样本评估,无需特定领域训练或微调,表明了一种可扩展的AI辅助预碰撞分析方法。
Summary / 总结
This study aims to improve pre-crash reconstruction accuracy by developing an AI-driven multi-agent framework. The framework consists of two phases: generating natural-language crash reconstructions from multimodal inputs and combining these reconstructions with EDR signals to identify vehicles and isolate relevant EDR records. The system achieved 100% accuracy across 4,155 trials and maintained robust performance under incomplete inputs, suggesting a scalable approach for AI-assisted pre-crash analysis.
本研究开发了一种基于多智能体的AI框架,用于交通碰撞中的预碰撞重建。框架分为两个阶段:从多模态输入生成自然语言的碰撞重建,并结合EDR信号来识别车辆并隔离相关的EDR记录。系统在4,155次试验中实现了100%的准确率,并在不完整输入下保持了鲁棒性。消融测试表明,结构化的推理锚点对准确率有显著贡献。未经重建训练的研究分析师在复杂案例上也表现出高准确率,表明该方法具有可扩展性,适用于AI辅助的预碰撞分析。
ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution
Authors: Yahia Alqurnawi, Preetom Biswas, Anmol Rao, Tejas Anvekar, Chitta Baral, Vivek Gupta
First: 2026-02-17T18:01:35+00:00 · Latest: 2026-02-17T18:01:35+00:00
Abstract
Multimodal Large Language Models (mLLMs) are often used to answer questions in structured data such as tables in Markdown, JSON, and images. While these models can often give correct answers, users also need to know where those answers come from. In this work, we study structured data attribution/citation, which is the ability of the models to point to the specific rows and columns that support an answer. We evaluate several mLLMs across different table formats and prompting strategies. Our results show a clear gap between question answering and evidence attribution. Although question answering accuracy remains moderate, attribution accuracy is much lower, near random for JSON inputs, across all models. We also find that models are more reliable at citing rows than columns, and struggle more with textual formats than images. Finally, we observe notable differences across model families. Overall, our findings show that current mLLMs are unreliable at providing fine-grained, trustworthy attribution for structured data, which limits their usage in applications requiring transparency and traceability.
中文标题/摘要
标题:ViTaB-A:评估多模态大型语言模型在视觉表格归因上的表现
多模态大型语言模型(mLLMs)常用于回答Markdown、JSON和图像中的结构化数据问题。尽管这些模型可以给出正确的答案,但用户也需要知道这些答案来自何处。在本研究中,我们探讨了结构化数据归因/引用,即模型指出支持答案的具体行和列的能力。我们评估了不同表格格式和提示策略下的多种mLLMs。结果显示,问题回答和证据归因之间存在明显差距。尽管问题回答的准确性仍然适中,但归因准确性要低得多,对于JSON输入,几乎所有模型的准确性都接近随机。我们还发现,模型在引用行方面比引用列更可靠,且在文本格式方面比图像更难以处理。最后,我们观察到不同模型家族之间存在显著差异。总体而言,我们的研究结果表明,当前的mLLMs在提供结构化数据的细粒度、可信赖归因方面不可靠,这限制了它们在需要透明性和可追溯性的应用中的使用。
Robot-Assisted Social Dining as a White Glove Service
Authors: Atharva S Kashyap, Ugne Aleksandra Morkute, Patricia Alves-Oliveira
First: 2026-02-17T17:58:25+00:00 · Latest: 2026-02-17T17:58:25+00:00
Comments: 20 pages, 9 figures. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI '26)
Abstract
Robot-assisted feeding enables people with disabilities who require assistance eating to enjoy a meal independently and with dignity. However, existing systems have only been tested in-lab or in-home, leaving in-the-wild social dining contexts (e.g., restaurants) largely unexplored. Designing a robot for such contexts presents unique challenges, such as dynamic and unsupervised dining environments that a robot needs to account for and respond to. Through speculative participatory design with people with disabilities, supported by semi-structured interviews and a custom AI-based visual storyboarding tool, we uncovered ideal scenarios for in-the-wild social dining. Our key insight suggests that such systems should: embody the principles of a white glove service where the robot (1) supports multimodal inputs and unobtrusive outputs; (2) has contextually sensitive social behavior and prioritizes the user; (3) has expanded roles beyond feeding; (4) adapts to other relationships at the dining table. Our work has implications for in-the-wild and group contexts of robot-assisted feeding.
中文标题/摘要
标题:机器人辅助社交餐饮作为白手套服务
机器人辅助喂食使需要他人协助进食的残疾人能够独立且有尊严地享受餐饮。然而,现有的系统仅在实验室或家庭中进行测试,社交餐饮环境(如餐馆)等野外场景尚未被充分探索。为这些环境设计机器人带来了独特的挑战,例如机器人需要适应和应对动态且未监督的餐饮环境。通过与残疾人的推测参与式设计,结合半结构化访谈和自定义基于AI的视觉故事板工具,我们发现了野外社交餐饮的理想场景。我们的主要见解表明,此类系统应体现白手套服务的原则,其中机器人(1)支持多模态输入和不显眼的输出;(2)具有上下文敏感的社会行为并优先考虑用户;(3)扩展其角色,不仅限于喂食;(4)适应餐桌上的其他关系。我们的研究对机器人辅助喂食在野外和群体环境中的应用具有重要意义。
Summary / 总结
This study explores the design of robot-assisted feeding systems for social dining in public settings like restaurants, addressing the limitations of existing systems that have only been tested in labs or homes. Through participatory design and interviews, the researchers identified key features such as unobtrusive multimodal interaction, context-aware social behavior, and expanded roles beyond feeding. The study highlights the need for robots to adapt to dynamic dining environments and prioritize the user's experience, suggesting a 'white glove service' approach for in-the-wild social dining scenarios.
该研究探讨了为公共场合如餐馆提供助餐服务的机器人设计,解决了现有系统仅在实验室或家中测试的局限性。通过参与式设计和访谈,研究人员确定了关键功能,如不显眼的多模态交互、情境感知的社会行为以及超越喂食的扩展角色。研究强调了机器人需要适应动态的就餐环境并优先考虑用户体验,提出了适用于公共就餐场景的“白手套服务”方法。
Random Forests as Statistical Procedures: Design, Variance, and Dependence
Authors: Nathaniel S. O'Connell
First: 2026-02-13T17:08:43+00:00 · Latest: 2026-02-17T17:50:58+00:00
Comments: 27 pages, 2 figures. Supplementary material included
Abstract
Random forests are widely used prediction procedures, yet are typically described algorithmically rather than as statistical designs acting on a fixed set of covariates. We develop a finite-sample, design-based formulation of random forests in which each tree is an explicit randomized conditional regression function. This perspective yields an exact variance identity for the forest predictor that separates finite-aggregation variability from a structural dependence term that persists even under infinite aggregation. We further decompose both single-tree dispersion and inter-tree covariance using the laws of total variance and covariance, isolating two fundamental design mechanisms-reuse of training observations and alignment of data-adaptive partitions. These mechanisms induce a strict covariance floor, demonstrating that predictive variability cannot be eliminated by increasing the number of trees alone. The resulting framework clarifies how resampling, feature-level randomization, and split selection govern resolution, tree variability, and dependence, and establishes random forests as explicit finite-sample statistical designs whose behavior is determined by their underlying randomized construction.
中文标题/摘要
标题:随机森林作为统计程序:设计、方差与依赖性
随机森林广泛用于预测程序,但通常以算法形式而非作为作用于固定协变量集的统计设计来描述。我们发展了一种有限样本的基于设计的随机森林形式化方法,其中每棵树是显式的随机条件回归函数。这种视角导出了森林预测器的确切方差恒等式,将有限聚合的变异性与一个即使在无限聚合下也持续存在的结构依赖项分离开来。我们进一步使用总方差和总协方差定律分解单树分散性和树间协方差,分离出两种基本的设计机制——训练观测值的重用和数据自适应分割的对齐。这些机制诱导了一个严格的协方差下限,表明仅通过增加树的数量无法消除预测变异性。由此形成的框架阐明了抽样、特征级随机化和分裂选择如何控制分辨率、树变异性与依赖性,并确立了随机森林作为明确的有限样本统计设计,其行为由其基础的随机化构造决定。
Summary / 总结
The paper develops a statistical design perspective for random forests, treating each tree as a randomized conditional regression function. This approach provides an exact variance identity that distinguishes finite-aggregation variability from structural dependence. Key findings include the identification of two fundamental design mechanisms: reuse of training observations and alignment of data-adaptive partitions, which induce a strict covariance floor, showing that increasing the number of trees does not eliminate predictive variability alone.
论文旨在从统计学角度重新审视随机森林,将其视为基于设计的程序而非仅仅算法。它发展了一种有限样本的表述方式,其中每个树都是一个随机化的条件回归函数,从而得到了一个精确的方差公式,将聚合变异性和结构性依赖性区分开来。主要发现包括单棵树分散性和树间协方差的分解,揭示了由于基本设计机制(如训练样本的重用和数据自适应分割的对齐)的存在,增加树的数量并不能消除预测变异。
GLM-5: from Vibe Coding to Agentic Engineering
Authors: GLM-5 Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Cunxiang Wang, Gengzheng Pan, Hao Zeng, Haoke Zhang, Haoran Wang, Huilong Chen, Jiajie Zhang, Jian Jiao, Jiaqi Guo, Jingsen Wang, Jingzhao Du, Jinzhu Wu, Kedong Wang, Lei Li, Lin Fan, Lucen Zhong, Mingdao Liu, Mingming Zhao, Pengfan Du, Qian Dong, Rui Lu, Shuang-Li, Shulin Cao, Song Liu, Ting Jiang, Xiaodong Chen, Xiaohan Zhang, Xuancheng Huang, Xuezhen Dong, Yabo Xu, Yao Wei, Yifan An, Yilin Niu, Yitong Zhu, Yuanhao Wen, Yukuo Cen, Yushi Bai, Zhongpei Qiao, Zihan Wang, Zikang Wang, Zilin Zhu, Ziqiang Liu, Zixuan Li, Bojie Wang, Bosi Wen, Can Huang, Changpeng Cai, Chao Yu, Chen Li, Chen Li, Chenghua Huang, Chengwei Hu, Chenhui Zhang, Chenzheng Zhu, Congfeng Yin, Daoyan Lin, Dayong Yang, Di Wang, Ding Ai, Erle Zhu, Fangzhou Yi, Feiyu Chen, Guohong Wen, Hailong Sun, Haisha Zhao, Haiyi Hu, Hanchen Zhang, Hanrui Liu, Hanyu Zhang, Hao Peng, Hao Tai, Haobo Zhang, He Liu, Hongwei Wang, Hongxi Yan, Hongyu Ge, Huan Liu, Huan Liu, Huanpeng Chu, Jia'ni Zhao, Jiachen Wang, Jiajing Zhao, Jiamin Ren, Jiapeng Wang, Jiaxin Zhang, Jiayi Gui, Jiayue Zhao, Jijie Li, Jing An, Jing Li, Jingwei Yuan, Jinhua Du, Jinxin Liu, Junkai Zhi, Junwen Duan, Kaiyue Zhou, Kangjian Wei, Ke Wang, Keyun Luo, Laiqiang Zhang, Leigang Sha, Liang Xu, Lindong Wu, Lintao Ding, Lu Chen, Minghao Li, Nianyi Lin, Pan Ta, Qiang Zou, Rongjun Song, Ruiqi Yang, Shangqing Tu, Shangtong Yang, Shaoxiang Wu, Shengyan Zhang, Shijie Li, Shuang Li, Shuyi Fan, Wei Qin, Wei Tian, Weining Zhang, Wenbo Yu, Wenjie Liang, Xiang Kuang, Xiangmeng Cheng, Xiangyang Li, Xiaoquan Yan, Xiaowei Hu, Xiaoying Ling, Xing Fan, Xingye Xia, Xinyuan Zhang, Xinze Zhang, Xirui Pan, Xunkai Zhang, Yandong Wu, Yanfu Li, Yidong Wang, Yifan Zhu, Yijun Tan, Yilin Zhou, Yiming Pan, Ying Zhang, Yinpei Su, Yipeng Geng, Yipeng Geng, Yong Yan, Yonglin Tan, Yuean Bi, Yuhan Shen, Yuhao Yang, Yujiang Li, Yunan Liu, Yunqing Wang, Yuntao Li, Yurong Wu, Yutao Zhang, Yuxi Duan, Yuxuan Zhang, Zezhen Liu, Zhengtao Jiang, Zhenhe Yan, Zheyu Zhang, Zhixiang Wei, Zhuo Chen, Zhuoer Feng, Zijun Yao, Ziwei Chai, Ziyuan Wang, Zuzhou Zhang, Bin Xu, Minlie Huang, Hongning Wang, Juanzi Li, Yuxiao Dong, Jie Tang
First: 2026-02-17T17:50:56+00:00 · Latest: 2026-02-17T17:50:56+00:00
Abstract
We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available at https://github.com/zai-org/GLM-5.
中文标题/摘要
标题:GLM-5:从 vibe 编码到能动工程
我们介绍了 GLM-5,这是一种下一代基础模型,旨在将 vibe 编码的范式转变为能动工程。GLM-5 基于其前代的能动、推理和编码(ARC)能力,采用 DSA 显著降低训练和推理成本,同时保持长上下文保真度。为了推进模型对齐和自主性,我们实现了一种新的异步强化学习基础设施,通过将生成与训练解耦,大幅提高训练后效率。此外,我们提出了新的异步代理 RL 算法,进一步提高 RL 质量,使模型能够更有效地从复杂的长期交互中学习。通过这些创新,GLM-5 在主要的开放基准测试中达到了最先进的性能。最关键的是,GLM-5 在实际编码任务中展示了前所未有的能力,超越了之前的基线,在处理端到端的软件工程挑战方面表现更佳。代码、模型及相关信息可在 https://github.com/zai-org/GLM-5 获取。
Summary / 总结
GLM-5 is a next-generation foundation model that shifts from vibe coding to agentic engineering. It leverages DSA to reduce costs while preserving long-context fidelity. GLM-5 introduces an asynchronous reinforcement learning infrastructure and novel RL algorithms, enhancing post-training efficiency and enabling better learning from complex interactions. GLM-5 shows superior performance on open benchmarks and excels in real-world coding tasks, surpassing previous models in end-to-end software engineering challenges.
GLM-5 是一种下一代基础模型,旨在从 vibe 编码转向 agentic 工程。它利用 DSA 降低成本同时保持长上下文保真度。GLM-5 实现了异步强化学习基础设施和新型 RL 算法,提高后训练效率并更好地从复杂交互中学习。GLM-5 在基准测试中达到最先进的性能,并在实际编码任务中超越了之前的基线。
ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
Authors: Manav Nitin Kapadnis, Lawanya Baghel, Atharva Naik, Carolyn Rosé
First: 2026-02-17T17:45:34+00:00 · Latest: 2026-02-17T17:45:34+00:00
Comments: 16 pages, 13 figures including Supplementary Material
Abstract
While Multimodal Large Language Models (MLLMs) perform strongly on single-turn chart generation, their ability to support real-world exploratory data analysis remains underexplored. In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences. We introduce ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset. Unlike prior one-shot benchmarks, ChartEditBench evaluates sustained, context-aware editing. We further propose a robust evaluation framework that mitigates limitations of LLM-as-a-Judge metrics by integrating execution-based fidelity checks, pixel-level visual similarity, and logical code verification. Experiments with state-of-the-art MLLMs reveal substantial degradation in multi-turn settings due to error accumulation and breakdowns in shared context, with strong performance on stylistic edits but frequent execution failures on data-centric transformations. ChartEditBench, establishes a challenging testbed for grounded, intent-aware multimodal programming.
中文标题/摘要
标题:ChartEditBench:评估多轮图表编辑能力的多模态语言模型
虽然多模态大型语言模型(MLLMs)在单轮图表生成方面表现出色,但它们支持实际探索性数据分析的能力仍被忽视。实际上,用户通过多轮交互迭代细化可视化,这需要保持共同理解,跟踪先前的编辑,并适应不断变化的偏好。我们引入了ChartEditBench,这是一个通过代码实现增量、视觉接地图表编辑的基准,包含5,000条难度控制的修改链和一个严格的人工验证子集。与之前的单次编辑基准不同,ChartEditBench评估持续的、上下文相关的编辑。我们还提出了一种稳健的评估框架,通过结合执行基础的准确度检查、像素级视觉相似性和逻辑代码验证,缓解了LLM作为裁判的度量标准的局限性。实验表明,最先进的MLLMs在多轮设置中由于错误累积和共享上下文的失败而表现出显著下降,但在风格编辑方面表现出色,但在数据为中心的转换方面频繁出现执行失败。ChartEditBench为接地的、意图感知的多模态编程建立了具有挑战性的测试平台。
Summary / 总结
ChartEditBench evaluates the ability of Multimodal Large Language Models (MLLMs) to handle multi-turn chart editing, which is crucial for real-world data analysis. The benchmark includes 5,000 difficulty-controlled modification chains and a human-verified subset. The evaluation framework integrates execution-based fidelity checks, pixel-level visual similarity, and logical code verification to assess sustained, context-aware editing. Experiments show that MLLMs perform well on stylistic edits but struggle with data-centric transformations due to error accumulation and context breakdowns.
ChartEditBench 评估了多模态大型语言模型(MLLMs)在多轮图表编辑中的能力,这对于实际数据分析至关重要。基准包括5,000个难度控制的修改链和一个经过人工验证的子集。评估框架结合了执行基础的准确度检查、像素级别的视觉相似性和逻辑代码验证,以评估持续的、基于上下文的编辑。实验表明,MLLMs 在风格编辑方面表现良好,但在数据为中心的转换方面由于错误累积和上下文断裂而经常出现执行失败。
Beyond Binary Classification: Detecting Fine-Grained Sexism in Social Media Videos
Authors: Laura De Grazia, Danae Sánchez Villegas, Desmond Elliott, Mireia Farrús, Mariona Taulé
First: 2026-02-17T17:45:28+00:00 · Latest: 2026-02-17T17:45:28+00:00
Abstract
Online sexism appears in various forms, which makes its detection challenging. Although automated tools can enhance the identification of sexist content, they are often restricted to binary classification. Consequently, more subtle manifestations of sexism may remain undetected due to the lack of fine-grained, context-sensitive labels. To address this issue, we make the following contributions: (1) we present FineMuSe, a new multimodal sexism detection dataset in Spanish that includes both binary and fine-grained annotations; (2) we introduce a comprehensive hierarchical taxonomy that encompasses forms of sexism, non-sexism, and rhetorical devices of irony and humor; and (3) we evaluate a wide range of LLMs for both binary and fine-grained sexism detection. Our findings indicate that multimodal LLMs perform competitively with human annotators in identifying nuanced forms of sexism; however, they struggle to capture co-occurring sexist types when these are conveyed through visual cues.
中文标题/摘要
标题:超越二元分类:社交媒体视频中的细粒度性别歧视检测
在线性别歧视以多种形式出现,这使其检测变得具有挑战性。尽管自动化工具可以增强对性别歧视内容的识别,但它们通常仅限于二元分类。因此,由于缺乏细粒度和上下文敏感的标签,更微妙的性别歧视形式可能会被遗漏。为了解决这一问题,我们做出了以下贡献:(1) 我们提出了一个新的西班牙语多模态性别歧视检测数据集FineMuSe,其中包括二元和细粒度注释;(2) 我们引入了一个全面的层次分类体系,涵盖了性别歧视、非性别歧视以及讽刺和幽默的修辞手法;(3) 我们评估了多种语言模型在二元和细粒度性别歧视检测中的表现。我们的研究结果表明,多模态语言模型在识别微妙形式的性别歧视方面与人类注释者具有竞争力;然而,它们在通过视觉线索传达的多种性别歧视类型共现时难以捕捉到。
Summary / 总结
The paper addresses the challenge of detecting various forms of online sexism, which often goes beyond binary classification. It introduces FineMuSe, a new multimodal dataset with both binary and fine-grained annotations in Spanish, and evaluates a range of LLMs for sexism detection. The study finds that multimodal LLMs perform well in identifying nuanced sexism but have difficulty capturing co-occurring types when visual cues are involved.
论文通过提出一个新的多模态数据集FineMuSe,包含西班牙语中的细粒度注释,来应对在线性别歧视的多种形式的检测挑战。它引入了一个性别歧视的层次分类体系,并评估了LLM在二元和细粒度性别歧视检测中的表现。研究发现,多模态LLM在识别细微性别歧视方面表现良好,但在通过视觉线索传达多种性别歧视类型时难以捕捉到它们的共现情况。
A Note on Non-Composability of Layerwise Approximate Verification for Neural Inference
Authors: Or Zamir
First: 2026-02-17T17:41:59+00:00 · Latest: 2026-02-17T17:41:59+00:00
Abstract
A natural and informal approach to verifiable (or zero-knowledge) ML inference over floating-point data is: ``prove that each layer was computed correctly up to tolerance $δ$; therefore the final output is a reasonable inference result''. This short note gives a simple counterexample showing that this inference is false in general: for any neural network, we can construct a functionally equivalent network for which adversarially chosen approximation-magnitude errors in individual layer computations suffice to steer the final output arbitrarily (within a prescribed bounded range).
中文标题/摘要
标题:关于神经推理浮点数据可验证(或零知识)ML推理分层近似验证不可组合性的注记
一种自然且非正式的方法来验证(或零知识)浮点数据的ML推理是:``证明每一层的计算正确性在容差$δ$内;因此最终输出是一个合理的推理结果''。这篇简短的注记给出了一个简单的反例,表明这种推理在一般情况下是错误的:对于任何神经网络,我们都可以构造一个功能等价的网络,在其中个别层的计算中由对手选择的近似误差足以使最终输出任意地(在规定范围内)改变。
Summary / 总结
This paper addresses the non-composability of layerwise approximate verification in neural networks. It demonstrates through a counterexample that proving each layer's correctness up to a tolerance δ does not guarantee the final output's correctness, as adversarial errors in individual layers can steer the final output to any desired value within a bounded range. The main finding is that layerwise verification is insufficient for ensuring the overall inference's accuracy in neural networks.
该论文探讨了神经网络中层级近似验证的不可组合性。通过一个反例表明,即使证明了每一层的计算正确性在误差δ范围内,也不能保证最终输出的正确性,因为个别层的对抗性误差可以将最终输出引导到预设范围内的任何值。主要发现是,层级验证不足以确保整体推理的准确性。
Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac
Authors: Chahan Vidal-Gorène, Bastien Kindt, Florian Cafiero
First: 2026-02-17T17:34:32+00:00 · Latest: 2026-02-17T17:34:32+00:00
Abstract
Low-resource languages pose persistent challenges for Natural Language Processing tasks such as lemmatization and part-of-speech (POS) tagging. This paper investigates the capacity of recent large language models (LLMs), including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages: Ancient Greek, Classical Armenian, Old Georgian, and Syriac. Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline. Our results demonstrate that LLMs, even without fine-tuning, achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings. Significant challenges persist for languages characterized by complex morphology and non-Latin scripts, but we demonstrate that LLMs are a credible and relevant option for initiating linguistic annotation tasks in the absence of data, serving as an effective aid for annotation.
中文标题/摘要
标题:资源不足的语言的资源不足研究:使用大语言模型标注员进行历史亚美尼亚语、格鲁吉亚语、希腊语和叙利亚语的词形还原和词性标注
资源不足的语言对自然语言处理任务,如词形还原和词性标注,构成了持续的挑战。本文探讨了最近的大语言模型(LLMs),包括GPT-4变体和开源权重Mistral模型,在四种历史上和语言上多样化的资源不足语言:古希腊语、古典亚美尼亚语、古格鲁吉亚语和叙利亚语中的少量和零样本设置中处理这些任务的能力。使用一个新颖的基准,包括对齐的训练和领域外测试语料库,我们评估了基础模型在词形还原和词性标注任务上的性能,并将其与PIE(特定任务的RNN基线)进行了比较。我们的结果显示,即使未经微调,大语言模型在大多数语言的少量样本设置中也实现了竞争力或更优的性能。对于具有复杂形态和非拉丁字母书写的语言,仍然存在重大挑战,但我们证明了大语言模型在缺乏数据的情况下是进行语言标注任务的一个可信且相关的选择,作为注释的有效辅助。
LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning
Authors: Tiago Fernandes Tavares
First: 2025-09-26T11:27:22+00:00 · Latest: 2026-02-17T17:26:42+00:00
Comments: This version introduces a major architectural shift to Local LLMs and NLI-based assignment, scaling the framework to O(1) generative complexity. Formerly titled 'Question-Driven Analysis and Synthesis'
Abstract
The discovery of deep, steerable taxonomies in large text corpora is currently restricted by a trade-off between the surface-level efficiency of topic models and the prohibitive, non-scalable assignment costs of LLM-integrated frameworks. We introduce \textbf{LogiPart}, a scalable, hypothesis-first framework for building interpretable hierarchical partitions that decouples hierarchy growth from expensive full-corpus LLM conditioning. LogiPart utilizes locally hosted LLMs on compact, embedding-aware samples to generate concise natural-language taxonomic predicates. These predicates are then evaluated efficiently across the entire corpus using zero-shot Natural Language Inference (NLI) combined with fast graph-based label propagation, achieving constant $O(1)$ generative token complexity per node relative to corpus size. We evaluate LogiPart across four diverse text corpora (totaling $\approx$140,000 documents). Using structured manifolds for \textbf{calibration}, we identify an empirical reasoning threshold at the 14B-parameter scale required for stable semantic grounding. On complex, high-entropy corpora (Wikipedia, US Bills), where traditional thematic metrics reveal an ``alignment gap,'' inverse logic validation confirms the stability of the induced logic, with individual taxonomic bisections maintaining an average per-node routing accuracy of up to 96\%. A qualitative audit by an independent LLM-as-a-judge confirms the discovery of meaningful functional axes, such as policy intent, that thematic ground-truth labels fail to capture. LogiPart enables frontier-level exploratory analysis on consumer-grade hardware, making hypothesis-driven taxonomic discovery feasible under realistic computational and governance constraints.
中文标题/摘要
标题:LogiPart:通过逻辑分区实现大规模数据探索的本地大型语言模型
当前,大型文本语料库中深层次可操控分类体系的发现受到主题模型表面效率与LLM集成框架中昂贵的非可扩展分配成本之间的权衡限制。我们引入了**LogiPart**,一种可扩展的、基于假设的框架,用于构建可解释的层次分区,将层次结构的增长与昂贵的全语料库LLM条件化脱钩。LogiPart 利用本地托管的LLM在紧凑、嵌入感知样本上生成简洁的自然语言分类谓词。这些谓词通过零样本自然语言推理(NLI)与快速图基标签传播高效地在整个语料库中进行评估,实现相对于语料库大小节点生成复杂度为常数$O(1)$。我们使用结构化流形对**校准**,在14B参数规模下识别出稳定的语义接地的经验推理阈值。在复杂、高熵语料库(维基百科、美国法案)中,传统主题度量揭示了“对齐差距”,逆逻辑验证确认了诱导逻辑的稳定性,每个分类二分法的平均节点路由准确率高达96%。独立LLM作为法官的定性审计确认了发现有意义的功能轴,如政策意图,这些轴是主题真实标签无法捕捉到的。LogiPart 使在消费级硬件上实现前沿探索性分析成为可能,在现实的计算和治理约束下使假设驱动的分类发现成为可能。