Edit3r: Instant 3D Scene Editing from Sparse Unposed Images
Authors: Jiageng Liu, Weijie Lyu, Xueting Li, Yejie Guo, Ming-Hsuan Yang
First: 2025-12-31T18:59:53+00:00 · Latest: 2025-12-31T18:59:53+00:00
Comments: Project page: https://edit3r.github.io/edit3r/
Abstract
We present Edit3r, a feed-forward framework that reconstructs and edits 3D scenes in a single pass from unposed, view-inconsistent, instruction-edited images. Unlike prior methods requiring per-scene optimization, Edit3r directly predicts instruction-aligned 3D edits, enabling fast and photorealistic rendering without optimization or pose estimation. A key challenge in training such a model lies in the absence of multi-view consistent edited images for supervision. We address this with (i) a SAM2-based recoloring strategy that generates reliable, cross-view-consistent supervision, and (ii) an asymmetric input strategy that pairs a recolored reference view with raw auxiliary views, encouraging the network to fuse and align disparate observations. At inference, our model effectively handles images edited by 2D methods such as InstructPix2Pix, despite not being exposed to such edits during training. For large-scale quantitative evaluation, we introduce DL3DV-Edit-Bench, a benchmark built on the DL3DV test split, featuring 20 diverse scenes, 4 edit types and 100 edits in total. Comprehensive quantitative and qualitative results show that Edit3r achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, while operating at significantly higher inference speed, making it promising for real-time 3D editing applications.
中文标题/摘要
标题:Edit3r:从稀疏未对齐图像即时编辑3D场景
我们提出了Edit3r,这是一种单次通过框架,可以从未对齐、视角不一致、指令编辑过的图像中重建和编辑3D场景。与需要逐场景优化的先前方法不同,Edit3r可以直接预测指令对齐的3D编辑,从而实现快速且逼真的渲染,无需优化或姿态估计。训练此类模型的关键挑战在于缺乏多视角一致的编辑图像作为监督。我们通过(i)基于SAM2的重新着色策略生成可靠的、跨视角一致的监督,以及(ii)不对称输入策略,将重新着色的参考视图与原始辅助视图配对,鼓励网络融合和对齐不同的观察结果来解决这一问题。在推理时,我们的模型能够有效处理由2D方法(如InstructPix2Pix)编辑的图像,尽管在训练过程中并未接触到此类编辑。为了进行大规模的定量评估,我们引入了DL3DV-Edit-Bench基准,该基准基于DL3DV测试集构建,包含20个不同的场景、4种编辑类型和总共100次编辑。全面的定量和定性结果表明,Edit3r在语义对齐和3D一致性方面优于最近的基线方法,同时具有显著更高的推理速度,使其在实时3D编辑应用中具有前景。
Summary / 总结
Edit3r is a feed-forward framework that reconstructs and edits 3D scenes from unposed images in a single pass, without requiring per-scene optimization. It uses a SAM2-based recoloring strategy to generate reliable cross-view-consistent supervision and an asymmetric input strategy to encourage the network to fuse and align disparate observations. The model effectively handles 2D edits like InstructPix2Pix and achieves superior semantic alignment and enhanced 3D consistency compared to recent baselines, while operating at higher inference speed, making it suitable for real-time 3D editing applications.
Edit3r 是一个无需优化和姿态估计即可从不一致视角的图像中重建并编辑 3D 场景的前馈框架。它使用 SAM2 基础的重新着色策略生成可靠的监督,并使用不对称输入策略鼓励网络融合和对齐不同的观察。该模型可以处理如 InstructPix2Pix 等 2D 编辑,而无需在训练中接触此类编辑。定量和定性结果表明,Edit3r 在语义对齐和 3D 一致性方面优于最近的基线模型,同时具有更快的推理速度,适用于实时 3D 编辑应用。
Coordinated Humanoid Manipulation with Choice Policies
Authors: Haozhi Qi, Yen-Jen Wang, Toru Lin, Brent Yi, Yi Ma, Koushil Sreenath, Jitendra Malik
First: 2025-12-31T18:59:53+00:00 · Latest: 2025-12-31T18:59:53+00:00
Comments: Code and Website: https://choice-policy.github.io/
Abstract
Humanoid robots hold great promise for operating in human-centric environments, yet achieving robust whole-body coordination across the head, hands, and legs remains a major challenge. We present a system that combines a modular teleoperation interface with a scalable learning framework to address this problem. Our teleoperation design decomposes humanoid control into intuitive submodules, which include hand-eye coordination, grasp primitives, arm end-effector tracking, and locomotion. This modularity allows us to collect high-quality demonstrations efficiently. Building on this, we introduce Choice Policy, an imitation learning approach that generates multiple candidate actions and learns to score them. This architecture enables both fast inference and effective modeling of multimodal behaviors. We validate our approach on two real-world tasks: dishwasher loading and whole-body loco-manipulation for whiteboard wiping. Experiments show that Choice Policy significantly outperforms diffusion policies and standard behavior cloning. Furthermore, our results indicate that hand-eye coordination is critical for success in long-horizon tasks. Our work demonstrates a practical path toward scalable data collection and learning for coordinated humanoid manipulation in unstructured environments.
中文标题/摘要
标题:协调的人形操作策略
人形机器人在人类中心环境中操作具有巨大潜力,但实现头部、手部和腿部的全身协调仍是一个重大挑战。我们提出了一种结合模块化远程操作界面和可扩展学习框架的系统来解决这一问题。我们的远程操作设计将人形控制分解为直观的子模块,包括手眼协调、抓取原语、手臂末端执行器跟踪和移动。这种模块化使我们能够高效地收集高质量的演示。在此基础上,我们引入了选择策略,这是一种模仿学习方法,生成多个候选动作并学习评分。该架构能够实现快速推理和多模态行为的有效建模。我们在两个实际任务上验证了我们的方法:洗碗机装载和全身移动操作以擦白板。实验表明,选择策略显著优于扩散策略和标准行为克隆。此外,我们的结果表明,手眼协调对于长期任务的成功至关重要。我们的工作展示了在非结构化环境中实现协调人形操作的可扩展数据收集和学习的实际路径。
Summary / 总结
The research aims to achieve robust whole-body coordination in humanoid robots for human-centric environments. It introduces a modular teleoperation interface and a scalable learning framework called Choice Policy, which generates and scores multiple candidate actions for efficient data collection and multimodal behavior modeling. The approach is validated on dishwasher loading and whole-body loco-manipulation tasks, showing significant performance improvements over diffusion policies and standard behavior cloning, with hand-eye coordination identified as crucial for long-horizon tasks.
研究旨在通过人形机器人实现人体中心环境中的全身协调控制。提出了一种模块化远程操作界面和名为Choice Policy的可扩展学习框架,该框架生成并评分多个候选动作。实验表明,Choice Policy在洗碗机装载和全身移动操作擦黑板任务中优于扩散策略和标准行为克隆,强调了长时间任务中手眼协调的重要性。
Scaling Open-Ended Reasoning to Predict the Future
Authors: Nikhil Chandak, Shashwat Goel, Ameya Prabhu, Moritz Hardt, Jonas Geiping
First: 2025-12-31T18:59:51+00:00 · Latest: 2025-12-31T18:59:51+00:00
Comments: 45 pages
Abstract
High-stakes decision making involves reasoning under uncertainty about the future. In this work, we train language models to make predictions on open-ended forecasting questions. To scale up training data, we synthesize novel forecasting questions from global events reported in daily news, using a fully automated, careful curation recipe. We train the Qwen3 thinking models on our dataset, OpenForesight. To prevent leakage of future information during training and evaluation, we use an offline news corpus, both for data generation and retrieval in our forecasting system. Guided by a small validation set, we show the benefits of retrieval, and an improved reward function for reinforcement learning (RL). Once we obtain our final forecasting system, we perform held-out testing between May to August 2025. Our specialized model, OpenForecaster 8B, matches much larger proprietary models, with our training improving the accuracy, calibration, and consistency of predictions. We find calibration improvements from forecasting training generalize across popular benchmarks. We open-source all our models, code, and data to make research on language model forecasting broadly accessible.
中文标题/摘要
标题:将开放性推理扩展以预测未来
高风险决策涉及对未来不确定性的推理。在本研究中,我们训练语言模型对开放性预测问题进行预测。为了扩大训练数据,我们从每日新闻中报道的全球事件中合成新型预测问题,采用完全自动化的精心编纂配方。我们在OpenForesight数据集上训练Qwen3思考模型。为了防止训练和评估期间出现未来信息泄露,我们在数据生成和检索中使用离线新闻语料库。在一小部分验证集的指导下,我们展示了检索的好处以及强化学习(RL)中改进的奖励函数。一旦我们获得最终的预测系统,我们将在2025年5月至8月之间进行保留测试。我们的专门模型OpenForecaster 8B与更大规模的专有模型相当,我们的训练提高了预测的准确性、校准性和一致性。我们发现预测训练带来的校准改进在流行基准上具有普遍性。我们开源了所有模型、代码和数据,以使语言模型预测研究广泛可及。
Summary / 总结
This work aims to improve language models for predicting the future by training them on open-ended forecasting questions derived from daily news. The authors synthesize a large dataset, OpenForesight, using an automated curation process. They train the Qwen3 models on this dataset and use an offline news corpus for both data generation and retrieval. The model, OpenForecaster 8B, shows improved accuracy, calibration, and consistency in predictions compared to larger proprietary models. Calibration improvements generalize across popular benchmarks, and the authors open-source their models and data for broader research access.
该研究旨在通过增强语言模型的开放性推理能力来预测未来事件,这对于高风险决策至关重要。作者从每日新闻中合成预测问题,并在名为OpenForesight的数据集上训练Qwen3模型。他们使用离线新闻语料库进行数据生成和检索,以避免未来信息泄露。最终模型OpenForecaster 8B在准确度、校准性和一致性方面优于更大规模的专有模型,并且校准改进在多个基准测试中具有普适性。所有模型、代码和数据均已开源,以促进该领域的进一步研究。
From Inpainting to Editing: A Self-Bootstrapping Framework for Context-Rich Visual Dubbing
Authors: Xu He, Haoxian Zhang, Hejia Chen, Changyuan Zheng, Liyang Chen, Songlin Tang, Jiehui Huang, Xiaoqiang Liu, Pengfei Wan, Zhiyong Wu
First: 2025-12-31T18:58:30+00:00 · Latest: 2025-12-31T18:58:30+00:00
Comments: Project Page https://hjrphoebus.github.io/X-Dub
Abstract
Audio-driven visual dubbing aims to synchronize a video's lip movements with new speech, but is fundamentally challenged by the lack of ideal training data: paired videos where only a subject's lip movements differ while all other visual conditions are identical. Existing methods circumvent this with a mask-based inpainting paradigm, where an incomplete visual conditioning forces models to simultaneously hallucinate missing content and sync lips, leading to visual artifacts, identity drift, and poor synchronization. In this work, we propose a novel self-bootstrapping framework that reframes visual dubbing from an ill-posed inpainting task into a well-conditioned video-to-video editing problem. Our approach employs a Diffusion Transformer, first as a data generator, to synthesize ideal training data: a lip-altered companion video for each real sample, forming visually aligned video pairs. A DiT-based audio-driven editor is then trained on these pairs end-to-end, leveraging the complete and aligned input video frames to focus solely on precise, audio-driven lip modifications. This complete, frame-aligned input conditioning forms a rich visual context for the editor, providing it with complete identity cues, scene interactions, and continuous spatiotemporal dynamics. Leveraging this rich context fundamentally enables our method to achieve highly accurate lip sync, faithful identity preservation, and exceptional robustness against challenging in-the-wild scenarios. We further introduce a timestep-adaptive multi-phase learning strategy as a necessary component to disentangle conflicting editing objectives across diffusion timesteps, thereby facilitating stable training and yielding enhanced lip synchronization and visual fidelity. Additionally, we propose ContextDubBench, a comprehensive benchmark dataset for robust evaluation in diverse and challenging practical application scenarios.
中文标题/摘要
标题:从修复到编辑:一种基于上下文的视觉配音自强化框架
基于音频的视觉配音旨在使视频的唇部动作与新语音同步,但根本上受到理想训练数据的挑战:仅唇部动作不同而其他所有视觉条件都相同的配对视频。现有方法通过基于掩码的修复范式绕过了这一问题,不完整的视觉条件迫使模型同时生成缺失的内容并同步唇部,导致视觉伪影、身份漂移和同步不良。在本文中,我们提出了一种新颖的自强化框架,将视觉配音重新定义为一个从病态的修复任务到一个良好的视频到视频编辑问题。我们的方法首先使用扩散变换器作为数据生成器,合成理想的训练数据:每个真实样本的唇部修改的伴生视频,形成视觉对齐的视频对。然后,基于扩散变换器的音频驱动编辑器在这些对上端到端训练,利用完整的对齐输入视频帧专注于精确的音频驱动唇部修改。这种完整的、帧对齐的输入条件为编辑器提供了丰富的视觉上下文,提供了完整的身份线索、场景交互和连续的空间-时间动态。利用这种丰富的上下文,我们的方法能够实现高度准确的唇部同步、忠实的身份保留和对复杂野外场景的出色鲁棒性。我们还引入了一种时间步长自适应多阶段学习策略,作为必要组件以在扩散时间步长中分离相互冲突的编辑目标,从而促进稳定训练并提高唇部同步和视觉保真度。此外,我们提出了ContextDubBench,这是一个全面的基准数据集,用于在多样且具有挑战性的实际应用场景中进行稳健评估。
Summary / 总结
This paper addresses the challenge of audio-driven visual dubbing by proposing a self-bootstrapping framework that transforms the task from an ill-posed inpainting problem into a well-conditioned video-to-video editing problem. The framework uses a Diffusion Transformer to generate ideal training data and then trains an audio-driven editor on these pairs. This approach ensures precise lip synchronization, faithful identity preservation, and robustness in challenging scenarios. The method also introduces a timestep-adaptive multi-phase learning strategy to improve training stability and visual fidelity.
论文提出了一种自提升框架,将音频驱动的视觉配音任务从一个病态的图像填充问题转化为一个良好的视频到视频编辑问题。该框架使用扩散变换器生成理想的训练数据,并在这些数据对上训练一个音频驱动的编辑器。这种方法确保了高度准确的唇部同步、忠实的身份保留以及在挑战性场景中的鲁棒性。此外,该方法还引入了一种时间步长自适应多阶段学习策略,以提高训练的稳定性并增强唇部同步和视觉保真度。
Vulcan: Instance-Optimal Systems Heuristics Through LLM-Driven Search
Authors: Rohit Dwivedula, Divyanshu Saxena, Sujay Yadalam, Daehyeok Kim, Aditya Akella
First: 2025-12-31T18:58:19+00:00 · Latest: 2025-12-31T18:58:19+00:00
Comments: 27 pages, 11 figures, 7 tables
Abstract
Resource-management tasks in modern operating and distributed systems continue to rely primarily on hand-designed heuristics for tasks such as scheduling, caching, or active queue management. Designing performant heuristics is an expensive, time-consuming process that we are forced to continuously go through due to the constant flux of hardware, workloads and environments.
We propose a new alternative: synthesizing instance-optimal heuristics -- specialized for the exact workloads and hardware where they will be deployed -- using code-generating large language models (LLMs). To make this synthesis tractable, Vulcan separates policy and mechanism through LLM-friendly, task-agnostic interfaces. With these interfaces, users specify the inputs and objectives of their desired policy, while Vulcan searches for performant policies via evolutionary search over LLM-generated code. This interface is expressive enough to capture a wide range of system policies, yet sufficiently constrained to allow even small, inexpensive LLMs to generate correct and executable code.
We use Vulcan to synthesize performant heuristics for cache eviction and memory tiering, and find that these heuristics outperform all human-designed state-of-the-art algorithms by upto 69% and 7.9% in performance for each of these tasks respectively.
中文标题/摘要
标题:Vulcan:通过LLM驱动搜索实现实例最优系统启发式方法
现代操作系统和分布式系统中的资源管理任务仍然主要依赖手工设计的启发式方法来处理诸如调度、缓存或活跃队列管理等任务。设计高效的启发式方法是一个既昂贵又耗时的过程,由于硬件、工作负载和环境的不断变化,我们被迫不断重复这个过程。
我们提出了一种新的替代方案:使用代码生成的大语言模型(LLM)合成针对具体工作负载和硬件实例最优的启发式方法。为了使这种合成变得可行,Vulcan 通过大语言模型友好的、任务无关的接口将策略和机制分离。通过这些接口,用户指定所需策略的输入和目标,而 Vulcan 则通过在大语言模型生成的代码上进行进化搜索来寻找高效的策略。这种接口足够灵活以捕捉各种系统策略,同时又足够约束,即使使用小型、廉价的大语言模型也能生成正确的可执行代码。
我们使用 Vulcan 合成缓存淘汰和内存分层的高效启发式方法,并发现这些启发式方法在性能上分别比所有手工设计的最先进的算法高出69%和7.9%。
Summary / 总结
The paper proposes Vulcan, a system that uses code-generating large language models (LLMs) to synthesize instance-optimal heuristics for resource management tasks. By separating policy and mechanism through LLM-friendly interfaces, Vulcan enables users to specify policy inputs and objectives, while the system searches for performant policies via evolutionary search. Vulcan was used to generate heuristics for cache eviction and memory tiering, which outperformed human-designed state-of-the-art algorithms by up to 69% and 7.9% respectively.
Vulcan 提出使用代码生成的大语言模型(LLMs)来合成针对特定工作负载和硬件的最优策略,以应对现代资源管理任务中设计高性能策略的挑战。通过分离策略和机制,Vulcan 允许用户指定策略输入和目标,而 LLMs 则通过进化代码生成搜索最优策略。实验表明,Vulcan 生成的缓存淘汰和内存分层策略分别比人类设计的最先进的算法性能高出 69% 和 7.9%。
Deep sequence models tend to memorize geometrically; it is unclear why
Authors: Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar
First: 2025-10-30T17:40:22+00:00 · Latest: 2025-12-31T18:57:25+00:00
Abstract
Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a brute-force lookup of co-occurring entities. We identify a dramatically different form of storage of atomic facts that we term as geometric memory. Here, the model has synthesized embeddings encoding novel global relationships between all entities, including ones that do not co-occur in training. Such storage is powerful: for instance, we show how it transforms a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn $1$-step navigation task.
From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, as against a lookup of local associations, cannot be straightforwardly attributed to typical supervisory, architectural, or optimizational pressures. Counterintuitively, a geometry is learned even when it is more complex than the brute-force lookup.
Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points out to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery, and unlearning.
中文标题/摘要
标题:深度序列模型倾向于几何记忆;原因尚不清楚
深度序列模型被认为主要以关联记忆的形式存储原子事实:通过直接查找共现实体来进行粗暴的查找。我们发现了一种截然不同的原子事实存储形式,我们称之为几何记忆。在此过程中,模型合成了嵌入,编码了所有实体之间的新型全局关系,包括训练中未共现的实体。这种存储方式非常强大:例如,我们展示了它如何将一个涉及 $\ell$ 次复合的困难推理任务转化为一个易于学习的一步导航任务。
从这一现象中,我们提取了难以解释的神经嵌入几何的基本方面。我们认为,这种几何结构的出现,而不是查找局部关联,不能简单地归因于典型的监督、架构或优化压力。令人意外的是,即使几何结构比粗暴查找更复杂,也会被学习。
然后,通过分析与Node2Vec的联系,我们展示了这种几何结构源自一种频谱偏差,这种偏差——与现有理论相反——确实自然地出现,尽管缺乏各种压力。这种分析还指出了实践者如何使Transformer记忆更加强烈地几何化。我们希望几何视角的参数记忆鼓励研究人员重新审视指导知识获取、容量、发现和遗忘的默认直觉。
Summary / 总结
The study explores how deep sequence models store information, challenging the notion that they primarily use associative memory. Instead, it identifies a geometric memory mechanism where the model creates embeddings that encode relationships between entities, even those not co-occurring in training data. This geometric memory simplifies complex reasoning tasks. The research suggests that the emergence of such geometric structures is not easily explained by typical training pressures and instead arises due to a spectral bias, which is explored through a connection to Node2Vec. The findings imply that practitioners can enhance geometric properties in Transformer models.
研究探讨了深度序列模型如何存储信息,挑战了它们主要使用关联记忆的观点。相反,研究发现模型通过创建嵌入来编码实体之间的关系,即使这些实体在训练数据中未共同出现。这种几何记忆简化了复杂的推理任务。研究指出,这种几何结构的出现不能简单地归因于典型的训练压力,而是由于一种光谱偏差,这一点通过与Node2Vec的联系得到了探索。研究结果表明,从业者可以通过增强Transformer模型的几何特性来改进。
Many Minds from One Model: Bayesian Transformers for Population Intelligence
Authors: Diji Yang, Yi Zhang
First: 2025-12-31T18:56:02+00:00 · Latest: 2025-12-31T18:56:02+00:00
Abstract
Despite their scale and success, modern transformers are almost universally trained as single-minded systems: optimization produces one deterministic set of parameters, representing a single functional hypothesis about the data. Motivated by the idea that intelligence emerge from many minds, we propose Population Bayesian Transformers (B-Trans), which transform a standard Large Language Model into a Bayesian Transformer model to supports sampling diverse yet coherent model instances from a single set of pre-trained weights.
B-Trans introduces a Bayesian-motivated posterior proxy by treating the bias-like offsets in normalization layers as stochastic variables with a Gaussian variational approximation, inducing a distribution over model behavior without the cost of training full Bayesian neural networks. Sampling from this proxy yields a set of model instances with diverse behaviors while maintaining general competence. To preserve coherence within each generation, we freeze the sampled noise at the sequence level, enforcing temporal consistency across tokens. B-Trans allows for population-level decision-making, where aggregating predictions across sampled individuals significantly enhances exploration. Experiments across zero-shot generation, Reinforcement Learning with Verifiable Rewards (RLVR), and RL without explicit labels demonstrate that B-Trans effectively leverage the wisdom of crowds, yielding superior semantic diversity while achieving better task performance compared to deterministic baselines.
中文标题/摘要
标题:一个模型多个心智:基于贝叶斯的变换器实现群体智能
尽管现代变换器在规模和成功方面取得了巨大进展,但它们几乎无一例外地被训练为单一目标系统:优化产生一组确定性的参数,代表对数据的单一功能假设。受智能源自多个心智这一理念的启发,我们提出了群体贝叶斯变换器(B-Trans),将标准大型语言模型转换为贝叶斯变换器模型,以支持从一组预训练权重中采样多样但又连贯的模型实例。
B-Trans 通过将归一化层中的偏置类似偏移视为具有高斯变分近似的随机变量,引入了一个贝叶斯动机的后验代理,从而在不训练完整的贝叶斯神经网络的情况下诱导模型行为的分布。从这个代理中采样会产生一组具有不同行为但保持一般能力的模型实例。为了在每次生成中保持连贯性,我们在序列级别冻结采样的噪声,确保在各个标记之间的时间一致性。B-Trans 允许群体级别的决策,其中跨采样个体汇总预测显著增强了探索性。在零样本生成、具有可验证奖励的强化学习(RLVR)以及无需显式标签的强化学习实验中,B-Trans 有效地利用了群体的智慧,提供了更好的语义多样性和任务性能,优于确定性基线。
Summary / 总结
The paper proposes Population Bayesian Transformers (B-Trans) to address the limitation of modern transformers being single-minded systems. B-Trans introduces a Bayesian posterior proxy by treating normalization layer biases as stochastic variables, allowing sampling of diverse yet coherent model instances from pre-trained weights. Experiments show that B-Trans enhances semantic diversity and task performance in zero-shot generation, RL with verifiable rewards, and RL without explicit labels, outperforming deterministic baselines.
本文提出了Population Bayesian Transformers(B-Trans),旨在解决现代变压器作为单一思维系统的局限性。B-Trans 将标准大型语言模型转换为贝叶斯模型,可以从单个预训练权重集中采样出多样但具有一致性的模型实例。该方法通过将归一化层偏置视为随机变量引入了贝叶斯后验近似,从而可以在保持通用能力的同时采样出具有不同行为的模型实例。实验表明,B-Trans 通过群体决策显著增强了语义多样性,并在零样本生成、具有验证奖励的强化学习以及无显式标签的强化学习中优于确定性基线模型。
Reliable and Resilient Collective Communication Library for LLM Training and Serving
Authors: Wei Wang, Nengneng Yu, Sixian Xiong, Zaoxing Liu
First: 2025-12-31T18:53:11+00:00 · Latest: 2025-12-31T18:53:11+00:00
Abstract
Modern ML training and inference now span tens to tens of thousands of GPUs, where network faults can waste 10--15\% of GPU hours due to slow recovery. Common network errors and link fluctuations trigger timeouts that often terminate entire jobs, forcing expensive checkpoint rollback during training and request reprocessing during inference. We present R$^2$CCL, a fault-tolerant communication library that provides lossless, low-overhead failover by exploiting multi-NIC hardware. R$^2$CCL performs rapid connection migration, bandwidth-aware load redistribution, and resilient collective algorithms to maintain progress under failures. We evaluate R$^2$CCL on two 8-GPU H100 InfiniBand servers and via large-scale ML simulators modeling hundreds of GPUs with diverse failure patterns. Experiments show that R$^2$CCL is highly robust to NIC failures, incurring less than 1\% training and less than 3\% inference overheads. R$^2$CCL outperforms baselines AdapCC and DejaVu by 12.18$\times$ and 47$\times$, respectively.
中文标题/摘要
标题:面向LLM训练和服务的可靠和弹性集体通信库
现代ML训练和推理现在跨越了从十到数万个GPU,其中网络故障可能会浪费10-15%的GPU时间,由于缓慢的恢复。常见的网络错误和链路波动会触发超时,通常会终止整个任务,迫使在训练期间进行昂贵的检查点回滚,在推理期间重新处理请求。我们提出了R$^2$CCL,这是一种容错通信库,通过利用多网卡硬件提供无损、低开销的故障转移。R$^2$CCL执行快速连接迁移、带宽感知负载重分布和弹性集体算法,以在故障情况下保持进度。我们在两个8-GPU H100 InfiniBand服务器上评估了R$^2$CCL,并通过模拟数百个具有不同故障模式的GPU的大规模ML模拟器进行评估。实验表明,R$^2$CCL对网卡故障具有高度的鲁棒性,训练和推理的开销分别低于1%和3%。R$^2$CCL分别比基线AdapCC和DejaVu快12.18倍和47倍。
Summary / 总结
R$^2$CCL is a fault-tolerant communication library designed for modern machine learning (ML) training and inference across multiple GPUs. It uses multi-NIC hardware to provide lossless, low-overhead failover, enabling rapid connection migration and bandwidth-aware load redistribution to maintain progress under network faults. Experiments on two 8-GPU H100 InfiniBand servers and large-scale ML simulators show that R$^2$CCL is highly robust to NIC failures, with less than 1% training and 3% inference overheads, outperforming baselines AdapCC and DejaVu by 12.18x and 47x respectively.
R$^2$CCL是一种针对多GPU的ML训练和推理的容错通信库,利用多网卡硬件提供无损故障转移、快速连接迁移和带宽感知的负载重分布,以在网络故障下保持进度。实验在8-GPU服务器和大规模ML模拟器上进行,结果显示R$^2$CCL的开销极小,训练和推理分别只有不到1%和3%的开销,并且在与现有解决方案AdapCC和DejaVu的对比中表现出显著的优势。
Context-aware LLM-based AI Agents for Human-centered Energy Management Systems in Smart Buildings
Authors: Tianzhi He, Farrokh Jazizadeh
First: 2025-12-31T18:51:19+00:00 · Latest: 2025-12-31T18:51:19+00:00
Abstract
This study presents a conceptual framework and a prototype assessment for Large Language Model (LLM)-based Building Energy Management System (BEMS) AI agents to facilitate context-aware energy management in smart buildings through natural language interaction. The proposed framework comprises three modules: perception (sensing), central control (brain), and action (actuation and user interaction), forming a closed feedback loop that captures, analyzes, and interprets energy data to respond intelligently to user queries and manage connected appliances. By leveraging the autonomous data analytics capabilities of LLMs, the BEMS AI agent seeks to offer context-aware insights into energy consumption, cost prediction, and device scheduling, thereby addressing limitations in existing energy management systems. The prototype's performance was evaluated using 120 user queries across four distinct real-world residential energy datasets and different evaluation metrics, including latency, functionality, capability, accuracy, and cost-effectiveness. The generalizability of the framework was demonstrated using ANOVA tests. The results revealed promising performance, measured by response accuracy in device control (86%), memory-related tasks (97%), scheduling and automation (74%), and energy analysis (77%), while more complex cost estimation tasks highlighted areas for improvement with an accuracy of 49%. This benchmarking study moves toward formalizing the assessment of LLM-based BEMS AI agents and identifying future research directions, emphasizing the trade-off between response accuracy and computational efficiency.
中文标题/摘要
标题:智能建筑中面向人类中心的能源管理系统中的上下文感知LLM基AI代理
本研究提出了一种概念框架和原型评估,用于通过自然语言交互促进智能建筑中上下文感知能源管理的大型语言模型(LLM)基建筑能源管理系统(BEMS)AI代理。所提出的框架包括三个模块:感知(传感)、中央控制(大脑)和行动(执行和用户交互),形成一个闭环反馈回路,捕捉、分析和解释能源数据,以智能响应用户查询并管理连接的电器。通过利用LLM的自主数据分析能力,BEMS AI代理旨在提供有关能源消耗、成本预测和设备调度的上下文感知见解,从而解决现有能源管理系统中的局限性。原型的性能使用120个用户查询和四个不同的真实住宅能源数据集以及不同的评估指标(包括延迟、功能、能力、准确性和成本效益)进行了评估。通过ANOVA测试展示了该框架的普适性。结果表明,通过设备控制(86%)、记忆相关任务(97%)、调度和自动化(74%)和能源分析(77%)的响应准确性衡量,表现出有希望的性能,而更复杂的成本估算任务则指出了改进的领域,准确率为49%。这项基准研究朝着正式化LLM基BEMS AI代理的评估和确定未来研究方向迈进,强调了响应准确性和计算效率之间的权衡。
Summary / 总结
This study proposes a conceptual framework and prototype for LLM-based BEMS AI agents to enhance context-aware energy management in smart buildings via natural language interaction. The framework includes perception, central control, and action modules, forming a closed loop for energy data analysis and response to user queries. Performance was evaluated using 120 user queries across four residential datasets, showing promising results in device control (86%), memory tasks (97%), scheduling (74%), and energy analysis (77%), but highlighting areas for improvement in cost estimation accuracy (49%).
该研究提出了一种基于大型语言模型(LLM)的建筑能源管理系统(BEMS)AI代理框架,通过自然语言交互来管理智能建筑中的能源消耗。该框架包括感知、中央控制和行动模块,形成一个闭环以进行能源数据的分析和响应。原型使用四个住宅能源数据集中的120个用户查询进行了评估,结果显示在设备控制、记忆任务和能源分析方面表现出色,但在成本估算准确性方面仍有改进空间。ANOVA测试显示了该框架的普适性。该研究旨在正式化LLM基于的BEMS AI代理的评估,并确定未来的研究方向。
AdaGReS:Adaptive Greedy Context Selection via Redundancy-Aware Scoring for Token-Budgeted RAG
Authors: Chao Peng, Bin Wang, Zhilei Long, Jinfang Sheng
First: 2025-12-31T18:48:07+00:00 · Latest: 2025-12-31T18:48:07+00:00
Comments: Preprint. Under review
Abstract
Retrieval-augmented generation (RAG) is highly sensitive to the quality of selected context, yet standard top-k retrieval often returns redundant or near-duplicate chunks that waste token budget and degrade downstream generation. We present AdaGReS, a redundancy-aware context selection framework for token-budgeted RAG that optimizes a set-level objective combining query-chunk relevance and intra-set redundancy penalties. AdaGReS performs greedy selection under a token-budget constraint using marginal gains derived from the objective, and introduces a closed-form, instance-adaptive calibration of the relevance-redundancy trade-off parameter to eliminate manual tuning and adapt to candidate-pool statistics and budget limits. We further provide a theoretical analysis showing that the proposed objective exhibits epsilon-approximate submodularity under practical embedding similarity conditions, yielding near-optimality guarantees for greedy selection. Experiments on open-domain question answering (Natural Questions) and a high-redundancy biomedical (drug) corpus demonstrate consistent improvements in redundancy control and context quality, translating to better end-to-end answer quality and robustness across settings.
中文标题/摘要
标题:AdaGReS:基于冗余感知评分的自适应贪婪上下文选择方法以优化令牌预算受限的RAG
检索增强生成(RAG)对所选上下文的质量极为敏感,但标准的top-k检索往往返回冗余或近似重复的片段,浪费令牌预算并降低下游生成质量。我们提出AdaGReS,一种针对令牌预算受限的RAG的冗余感知上下文选择框架,优化结合查询片段相关性和内部分冗余惩罚的集合级目标。AdaGReS 在令牌预算约束下进行贪婪选择,利用目标的边际收益,并引入了一种闭式、实例自适应的相关性-冗余权衡参数校准方法,以消除手动调参并适应候选池统计和预算限制。我们进一步提供理论分析,证明在实际嵌入相似条件下,所提出的目标具有ε-近似次模性,为贪婪选择提供近似最优性保证。在开放域问答(自然问题)和高冗余生物医学(药物)语料库上的实验表明,该方法在冗余控制和上下文质量方面表现出一致改进,转化为更好的端到端答案质量和不同场景下的鲁棒性。
Summary / 总结
AdaGReS is a redundancy-aware context selection framework for token-budgeted RAG that optimizes a set-level objective combining query-chunk relevance and intra-set redundancy penalties. It uses a greedy selection method under a token-budget constraint and introduces an adaptive calibration of the relevance-redundancy trade-off parameter. Experiments show consistent improvements in redundancy control and context quality, leading to better end-to-end answer quality and robustness in open-domain question answering and a high-redundancy biomedical corpus.
AdaGReS 是一种针对令牌预算 RAG 的冗余感知上下文选择框架,优化了结合查询片段相关性和内部分冗余惩罚的集合级目标。它在令牌预算约束下执行贪婪选择,并引入了相关性-冗余权衡参数的实例自适应校准。实验表明,它在冗余控制和上下文质量方面表现出一致的改进,从而提高了端到端的答案质量和鲁棒性。
Semantic Parsing with Candidate Expressions for Knowledge Base Question Answering
Authors: Daehwan Nam, Gary Geunbae Lee
Venue: Expert Syst. Appl. 306 (2026) 130564
First: 2024-10-01T05:46:22+00:00 · Latest: 2025-12-31T18:45:49+00:00
Abstract
Semantic parsers convert natural language to logical forms, which can be evaluated on knowledge bases (KBs) to produce denotations. Recent semantic parsers have been developed with sequence-to-sequence (seq2seq) pre-trained language models (PLMs) or large language models, where the models treat logical forms as sequences of tokens. For syntactic and semantic validity, the semantic parsers use grammars that enable constrained decoding. However, the grammars lack the ability to utilize large information of KBs, although logical forms contain representations of KB elements, such as entities or relations. In this work, we propose a grammar augmented with candidate expressions for semantic parsing on a large KB with a seq2seq PLM. The grammar defines actions as production rules, and our semantic parser predicts actions during inference under the constraints by types and candidate expressions. We apply the grammar to knowledge base question answering, where the constraints by candidate expressions assist a semantic parser to generate valid KB elements. We also introduce two special rules, sub-type inference and union types, and a mask caching algorithm. In particular, sub-type inference and the mask caching algorithm greatly increase the decoding speed of our semantic parser. We experimented on two benchmarks, KQA Pro and Overnight, where the constraints by candidate expressions increased the accuracy of our semantic parser, whether it was trained with strong supervision or weak supervision. In addition, our semantic parser had a fast decoding speed in the experiments. Our source code is publicly available at https://github.com/daehwannam/candexpr-sp.git.
中文标题/摘要
标题:基于候选表达式的语义解析在知识库问答中的应用
语义解析器将自然语言转换为逻辑形式,这些逻辑形式可以在知识库(KB)上进行评估以产生语义。最近开发的语义解析器使用序列到序列(seq2seq)预训练语言模型(PLMs)或大型语言模型,其中模型将逻辑形式视为标记序列。为了语法和语义的有效性,语义解析器使用语法来实现受限解码。然而,语法缺乏利用知识库大量信息的能力,尽管逻辑形式包含知识库元素(如实体或关系)的表示。在本工作中,我们提出了一种增强语法的方法,用于在大型知识库上使用seq2seq PLM进行语义解析。语法定义动作作为生产规则,我们的语义解析器在推理过程中在类型和候选表达式的约束下预测动作。我们将语法应用于知识库问答,候选表达式的约束有助于语义解析器生成有效的知识库元素。我们还引入了两种特殊规则,子类型推理和联合类型,以及掩码缓存算法。特别是,子类型推理和掩码缓存算法极大地提高了我们语义解析器的解码速度。我们在两个基准测试KQA Pro和Overnight上进行了实验,候选表达式的约束提高了我们语义解析器的准确性,无论其是否使用强监督或弱监督进行训练。此外,我们的语义解析器在实验中具有快速的解码速度。我们的源代码可在https://github.com/daehwannam/candexpr-sp.git上公开获取。
Summary / 总结
This paper proposes a semantic parser that uses candidate expressions to enhance syntactic and semantic validity in knowledge base question answering. The parser employs a seq2seq pre-trained language model with a grammar augmented by candidate expressions, which helps in generating valid KB elements. The parser also includes special rules and a mask caching algorithm to improve decoding speed. Experiments on KQA Pro and Overnight benchmarks show that the constraints by candidate expressions improve accuracy, regardless of the type of supervision, and the parser maintains fast decoding speed.
本文提出了一种将候选表达式集成到语法中的语义解析方法,用于知识库问答,采用序列到序列的预训练语言模型。该方法通过与知识库相关的约束和特殊规则来增强语法的语义有效性。实验结果表明,这些约束可以提高准确率,并且解析器具有较快的解码速度。
End-to-End Test-Time Training for Long Context
Authors: Arnuv Tandon, Karan Dalal, Xinhao Li, Daniel Koceja, Marcel Rød, Sam Buchanan, Xiaolong Wang, Jure Leskovec, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin, Jed McCaleb, Yejin Choi, Yu Sun
First: 2025-12-29T18:30:14+00:00 · Latest: 2025-12-31T18:41:09+00:00
Comments: Code: https://github.com/test-time-training/e2e
Abstract
We formulate long-context language modeling as a problem in continual learning rather than architecture design. Under this formulation, we only use a standard architecture -- a Transformer with sliding-window attention. However, our model continues learning at test time via next-token prediction on the given context, compressing the context it reads into its weights. In addition, we improve the model's initialization for learning at test time via meta-learning at training time. Overall, our method, a form of Test-Time Training (TTT), is End-to-End (E2E) both at test time (via next-token prediction) and training time (via meta-learning), in contrast to previous forms. We conduct extensive experiments with a focus on scaling properties. In particular, for 3B models trained with 164B tokens, our method (TTT-E2E) scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context. Our code is publicly available.
中文标题/摘要
标题:长上下文端到端测试时训练
我们将长上下文语言建模视为连续学习问题,而不是架构设计问题。在这种表述下,我们仅使用标准架构——具有滑动窗口注意力的Transformer。然而,我们的模型在测试时通过给定上下文的下一个标记预测继续学习,将读取的上下文压缩到其权重中。此外,我们通过在训练时进行元学习改进了模型的初始化,以便在测试时学习。总体而言,我们的方法,一种形式的测试时训练(TTT),在测试时(通过下一个标记预测)和训练时(通过元学习)都是端到端的,与之前的版本不同。我们进行了广泛的实验,重点关注缩放特性。特别是对于使用164B标记训练的3B模型,我们的方法(TTT-E2E)在上下文长度上的缩放与具有全注意力的Transformer相同,而其他方法,如Mamba 2和Gated DeltaNet,则不然。然而,类似于RNNs,TTT-E2E具有恒定的推理延迟,无论上下文长度如何,使其在128K上下文长度时比全注意力快2.7倍。我们的代码已公开。
Summary / 总结
The research aims to address long-context language modeling by formulating it as a continual learning problem, using a standard Transformer architecture with sliding-window attention. The model continuously learns at test time through next-token prediction, compressing the context into its weights. The method, End-to-End Test-Time Training (TTT-E2E), improves initialization via meta-learning and scales similarly to full attention models with context length, while offering constant inference latency, making it more efficient for long contexts.
研究旨在通过持续学习方法解决长上下文语言建模问题,使用标准的具有滑动窗口注意力机制的Transformer架构。模型在测试时通过预测给定上下文的下一个词来持续学习,将上下文压缩到其权重中。该方法,端到端测试时训练(TTT-E2E),通过元学习改进初始化,并在上下文长度增加时与全注意力模型的扩展性相同。然而,它保持恒定的推理延迟,对于长上下文比全注意力模型快2.7倍。
Plan Verification for LLM-Based Embodied Task Completion Agents
Authors: Ananth Hariharan, Vardhan Dongre, Dilek Hakkani-Tür, Gokhan Tur
First: 2025-09-02T19:06:56+00:00 · Latest: 2025-12-31T18:31:30+00:00
Abstract
Large language model (LLM) based task plans and corresponding human demonstrations for embodied AI may be noisy, with unnecessary actions, redundant navigation, and logical errors that reduce policy quality. We propose an iterative verification framework in which a Judge LLM critiques action sequences and a Planner LLM applies the revisions, yielding progressively cleaner and more spatially coherent trajectories. Unlike rule-based approaches, our method relies on natural language prompting, enabling broad generalization across error types including irrelevant actions, contradictions, and missing steps. On a set of manually annotated actions from the TEACh embodied AI dataset, our framework achieves up to 90% recall and 100% precision across four state-of-the-art LLMs (GPT o4-mini, DeepSeek-R1, Gemini 2.5, LLaMA 4 Scout). The refinement loop converges quickly, with 96.5% of sequences requiring at most three iterations, while improving both temporal efficiency and spatial action organization. Crucially, the method preserves human error-recovery patterns rather than collapsing them, supporting future work on robust corrective behavior. By establishing plan verification as a reliable LLM capability for spatial planning and action refinement, we provide a scalable path to higher-quality training data for imitation learning in embodied AI.
中文标题/摘要
标题:基于LLM的具身任务完成代理计划验证
基于大型语言模型(LLM)的任务计划及其对应的具身AI人类示范可能包含噪声,存在不必要的动作、冗余导航和逻辑错误,这些都会降低策略质量。我们提出了一种迭代验证框架,在该框架中,一个法官LLM批评动作序列,一个规划LLM应用修订,从而逐步生成更清洁且更具空间连贯性的轨迹。与基于规则的方法不同,我们的方法依赖于自然语言提示,能够广泛泛化不同类型错误,包括无关动作、矛盾和缺失步骤。在TEACh具身AI数据集中手动标注的动作集上,我们的框架在四个最先进的LLM(GPT o4-mini、DeepSeek-R1、Gemini 2.5、LLaMA 4 Scout)上实现了高达90%的召回率和100%的精确率。改进循环收敛迅速,96.5%的序列最多需要三轮迭代,同时提高了时间效率和空间动作组织。至关重要的是,该方法保留了人类错误恢复模式,而不是将其消除,支持未来关于稳健纠正行为的工作。通过将计划验证确立为空间规划和动作细化的可靠LLM能力,我们为具身AI中的模仿学习提供了可扩展的高质量训练数据路径。
Summary / 总结
The research aims to address the issues of noisy and logically flawed action sequences in large language model (LLM)-based task plans for embodied AI. It introduces an iterative verification framework where a Judge LLM critiques action sequences and a Planner LLM refines them, resulting in cleaner and more spatially coherent trajectories. The method, which uses natural language prompting, achieves up to 90% recall and 100% precision across four state-of-the-art LLMs. The refinement loop converges quickly, with 96.5% of sequences requiring at most three iterations, and it preserves human error-recovery patterns, enhancing both temporal efficiency and spatial action organization.
研究旨在解决基于大型语言模型(LLM)的体态AI任务计划中存在噪音和逻辑错误的问题。提出了一种迭代验证框架,其中裁判LLM评估动作序列,规划LLM进行改进,从而生成更清洁、更具空间连贯性的轨迹。该方法通过自然语言提示,在四个最先进的LLM中实现了高达90%的召回率和100%的精确率。改进循环快速收敛,96.5%的序列在最多三次迭代后即可完成,同时保留了人类的错误恢复模式,提高了时间和空间动作组织的效率。
Towards Generalisable Foundation Models for Brain MRI
Authors: Moona Mazher, Geoff J. M. Parker, Daniel C. Alexander
First: 2025-10-27T15:19:46+00:00 · Latest: 2025-12-31T18:26:04+00:00
Abstract
Foundation models in artificial intelligence (AI) are transforming medical imaging by enabling general-purpose feature learning from large-scale, unlabeled datasets. In this work, we introduce BrainFound, a self-supervised foundation model for brain MRI, built by extending DINO-v2, a vision transformer originally designed for 2D natural images. BrainFound adapts DINO-v2 to model full 3D brain anatomy by incorporating volumetric information from sequential MRI slices, moving beyond conventional single-slice paradigms. It supports both single- and multimodal inputs, enabling a broad range of downstream tasks, including disease detection and image segmentation, while generalising across varied imaging protocols and clinical scenarios. We show that BrainFound consistently outperforms existing self-supervised pretraining strategies and supervised baselines, particularly in label-scarce and multi-contrast settings. By integrating information from diverse 3D MRI modalities (e.g., T1, T2, FLAIR), it enhances diagnostic accuracy and reduces dependency on extensive expert annotations. This flexibility makes BrainFound a scalable and practical solution for 3D neuroimaging pipelines, with significant potential for clinical deployment and research innovation.
中文标题/摘要
标题:通用基础模型在脑MRI中的应用
人工智能(AI)中的基础模型正在通过从大规模未标记数据集中学习通用特征来改变医学成像。在本项研究中,我们介绍了BrainFound,这是一种基于DINO-v2扩展的自监督基础模型,用于脑MRI。BrainFound将DINO-v2扩展为通过结合连续MRI切片的体素信息来建模完整的3D脑解剖结构,超越了传统的单层成像范式。它支持单模态和多模态输入,能够执行一系列下游任务,包括疾病检测和图像分割,同时在不同的成像协议和临床场景中具有泛化能力。我们展示了BrainFound在标签稀缺和多对比度设置中始终优于现有的自监督预训练策略和监督基线。通过整合多种3D MRI模态(如T1、T2、FLAIR)的信息,它提高了诊断准确性并减少了对大量专家注释的依赖。这种灵活性使BrainFound成为3D神经成像管道的可扩展和实用解决方案,具有在临床部署和研究创新方面的巨大潜力。
Summary / 总结
The research aims to develop a generalisable foundation model for brain MRI by extending DINO-v2, a vision transformer, to handle 3D brain anatomy. BrainFound, the proposed model, incorporates volumetric information from sequential MRI slices and supports both single- and multimodal inputs, enabling various downstream tasks. Key findings show that BrainFound outperforms existing self-supervised pretraining strategies and supervised baselines, especially in label-scarce and multi-contrast settings, enhancing diagnostic accuracy and reducing the need for extensive expert annotations. This model is scalable and practical for 3D neuroimaging pipelines, with potential for clinical deployment and research innovation.
研究旨在通过将DINO-v2扩展到处理3D脑部解剖结构,开发一种通用的基础模型用于脑MRI。BrainFound模型整合了来自连续MRI切片的体素信息,并支持单模态和多模态输入,增强了其在各种下游任务中的适用性。关键发现表明,BrainFound在标签稀缺和多对比度设置中优于现有自监督预训练策略和监督基线,提高了诊断准确性并减少了对大量专家注释的依赖。
ResponseRank: Data-Efficient Reward Modeling through Preference Strength Learning
Authors: Timo Kaufmann, Yannick Metz, Daniel Keim, Eyke Hüllermeier
Venue: NeurIPS 2025
First: 2025-12-31T18:21:52+00:00 · Latest: 2025-12-31T18:21:52+00:00
Comments: NeurIPS 2025
Abstract
Binary choices, as often used for reinforcement learning from human feedback (RLHF), convey only the direction of a preference. A person may choose apples over oranges and bananas over grapes, but which preference is stronger? Strength is crucial for decision-making under uncertainty and generalization of preference models, but hard to measure reliably. Metadata such as response times and inter-annotator agreement can serve as proxies for strength, but are often noisy and confounded. We propose ResponseRank to address the challenge of learning from noisy strength signals. Our method uses relative differences in proxy signals to rank responses to pairwise comparisons by their inferred preference strength. To control for systemic variation, we compare signals only locally within carefully constructed strata. This enables robust learning of utility differences consistent with strength-derived rankings while making minimal assumptions about the strength signal. Our contributions are threefold: (1) ResponseRank, a novel method that robustly learns preference strength by leveraging locally valid relative strength signals; (2) empirical evidence of improved sample efficiency and robustness across diverse tasks: synthetic preference learning (with simulated response times), language modeling (with annotator agreement), and RL control tasks (with simulated episode returns); and (3) the Pearson Distance Correlation (PDC), a novel metric that isolates cardinal utility learning from ordinal accuracy.
中文标题/摘要
标题:ResponseRank:通过偏好强度学习实现高效的数据利用奖励建模
二元选择,如强化学习从人类反馈(RLHF)中常用的方式,只能传达偏好的方向。一个人可能选择苹果而不是橙子,香蕉而不是葡萄,但哪种偏好更强烈?强度对于在不确定性下做决策和偏好模型的泛化至关重要,但很难可靠地测量。元数据如响应时间以及注释者间的一致性可以作为强度的代理,但往往噪声较大且混杂。我们提出ResponseRank来解决从噪声强度信号中学习的挑战。我们的方法使用代理信号的相对差异来对成对比较的响应进行排序,以推断其偏好强度。为了控制系统性变化,我们仅在精心构建的层内局部比较信号。这使得在强度推导的排名中稳健地学习效用差异成为可能,同时对强度信号的假设最少。我们的贡献包括三个方面:(1) ResponseRank,一种新颖的方法,通过利用局部有效的相对强度信号稳健地学习偏好强度;(2) 在合成偏好学习(使用模拟响应时间)、语言建模(使用注释者一致性)和RL控制任务(使用模拟回合回报)等多样任务中,改进样本效率和稳健性的实证证据;(3) Pearson距离相关性(PDC),一种新颖的度量标准,能够从序数准确性中隔离基数效用学习。
Summary / 总结
ResponseRank is a method that addresses the challenge of learning preference strength from noisy signals by ranking responses based on relative differences in proxy signals. It uses local comparisons within carefully constructed strata to robustly learn utility differences, leading to improved sample efficiency and robustness across various tasks including synthetic preference learning, language modeling, and RL control tasks.
ResponseRank 是一种从嘈杂信号中学习偏好强度的方法,用于强化学习中的人类反馈。它通过相对差异的代理信号来对响应进行排序,并通过在局部构建的层内比较信号来控制系统性变化。该方法在合成偏好学习、语言建模和RL控制任务等多种任务中展示了改进的样本效率和鲁棒性。
Convergence of the generalization error for deep gradient flow methods for PDEs
Authors: Chenguang Liu, Antonis Papapantoleon, Jasper Rou
First: 2025-12-31T18:11:51+00:00 · Latest: 2025-12-31T18:11:51+00:00
Comments: 28 pages
Abstract
The aim of this article is to provide a firm mathematical foundation for the application of deep gradient flow methods (DGFMs) for the solution of (high-dimensional) partial differential equations (PDEs). We decompose the generalization error of DGFMs into an approximation and a training error. We first show that the solution of PDEs that satisfy reasonable and verifiable assumptions can be approximated by neural networks, thus the approximation error tends to zero as the number of neurons tends to infinity. Then, we derive the gradient flow that the training process follows in the ``wide network limit'' and analyze the limit of this flow as the training time tends to infinity. These results combined show that the generalization error of DGFMs tends to zero as the number of neurons and the training time tend to infinity.
中文标题/摘要
标题:深度梯度流方法(DGFMs)求解偏微分方程(PDEs)的泛化误差收敛性
本文旨在为深度梯度流方法(DGFMs)求解(高维)偏微分方程(PDEs)提供坚实的数学基础。我们将DGFMs的泛化误差分解为逼近误差和训练误差。我们首先证明,在满足合理且可验证假设的情况下,偏微分方程的解可以通过神经网络逼近,因此随着神经元数量趋于无穷,逼近误差趋于零。然后,我们在“宽网络极限”下推导出训练过程遵循的梯度流,并分析该流在训练时间趋于无穷时的极限。这些结果表明,随着神经元数量和训练时间趋于无穷,DGFMs的泛化误差趋于零。
MAMA-Memeia! Multi-Aspect Multi-Agent Collaboration for Depressive Symptoms Identification in Memes
Authors: Siddhant Agarwal, Adya Dhuler, Polly Ruhnke, Melvin Speisman, Md Shad Akhtar, Shweta Yadav
Venue: AAAI 2026
First: 2025-12-31T18:06:21+00:00 · Latest: 2025-12-31T18:06:21+00:00
Comments: Accepted by AAAI 2026
Abstract
Over the past years, memes have evolved from being exclusively a medium of humorous exchanges to one that allows users to express a range of emotions freely and easily. With the ever-growing utilization of memes in expressing depressive sentiments, we conduct a study on identifying depressive symptoms exhibited by memes shared by users of online social media platforms. We introduce RESTOREx as a vital resource for detecting depressive symptoms in memes on social media through the Large Language Model (LLM) generated and human-annotated explanations. We introduce MAMAMemeia, a collaborative multi-agent multi-aspect discussion framework grounded in the clinical psychology method of Cognitive Analytic Therapy (CAT) Competencies. MAMAMemeia improves upon the current state-of-the-art by 7.55% in macro-F1 and is established as the new benchmark compared to over 30 methods.
中文标题/摘要
标题:MAMA-Memeia!多方面多智能体协作识别表情包中的抑郁症状
近年来,表情包从单纯的幽默交流媒介演变为用户自由轻松表达各种情绪的平台。随着表情包在表达抑郁情绪方面的广泛应用,我们对在线社交媒体平台上用户分享的表情包中表现出的抑郁症状进行了研究。我们介绍了RESTOREx,这是一种通过大型语言模型生成和人类标注解释来检测社交媒体表情包中抑郁症状的重要资源。我们提出了MAMAMemeia,这是一种基于认知分析疗法(CAT)技能的多方面多智能体协作讨论框架。MAMAMemeia在宏F1上比当前最先进的方法提高了7.55%,并成为新的基准,超过了30多种方法。
Summary / 总结
The study aims to identify depressive symptoms in memes shared on social media. It introduces RESTOREx, a resource for detecting these symptoms using LLM-generated and human-annotated explanations. MAMAMemeia, a collaborative multi-agent multi-aspect framework based on Cognitive Analytic Therapy (CAT) competencies, is developed. MAMAMemeia outperforms existing methods by 7.55% in macro-F1 and sets a new benchmark.
研究旨在通过利用大型语言模型(LLM)生成和人工标注的解释来识别社交媒体上分享的 meme 中的抑郁症状。引入了基于认知分析疗法(CAT)技能的协作多方面多智能体框架 MAMAMemeia。MAMAMemeia 在宏观 F1 得分上比现有方法高出 7.55%,成为新的基准方法。
Diffusion Language Models are Provably Optimal Parallel Samplers
Authors: Haozhe Jiang, Nika Haghtalab, Lijie Chen
First: 2025-12-31T18:03:05+00:00 · Latest: 2025-12-31T18:03:05+00:00
Abstract
Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive models for faster inference via parallel token generation. We provide a rigorous foundation for this advantage by formalizing a model of parallel sampling and showing that DLMs augmented with polynomial-length chain-of-thought (CoT) can simulate any parallel sampling algorithm using an optimal number of sequential steps. Consequently, whenever a target distribution can be generated using a small number of sequential steps, a DLM can be used to generate the distribution using the same number of optimal sequential steps. However, without the ability to modify previously revealed tokens, DLMs with CoT can still incur large intermediate footprints. We prove that enabling remasking (converting unmasked tokens to masks) or revision (converting unmasked tokens to other unmasked tokens) together with CoT further allows DLMs to simulate any parallel sampling algorithm with optimal space complexity. We further justify the advantage of revision by establishing a strict expressivity gap: DLMs with revision or remasking are strictly more expressive than those without. Our results not only provide a theoretical justification for the promise of DLMs as the most efficient parallel sampler, but also advocate for enabling revision in DLMs.
中文标题/摘要
标题:扩散语言模型是可证明最优的并行采样器
扩散语言模型(DLMs)已成为通过并行生成令牌实现更快推理的自回归模型的有前途的替代方案。我们通过形式化并行采样的模型并证明,带有多项式长度链式思考(CoT)的DLMs可以使用最优数量的顺序步骤模拟任何并行采样算法。因此,当目标分布可以使用少量顺序步骤生成时,DLMs可以使用相同的最优顺序步骤生成该分布。然而,由于无法修改已揭示的令牌,带有CoT的DLMs仍可能产生大量中间足迹。我们证明,与CoT一起启用重新遮盖(将未遮盖的令牌转换为遮盖)或修订(将未遮盖的令牌转换为其他未遮盖的令牌)可以进一步使DLMs能够以最优空间复杂度模拟任何并行采样算法。我们还通过建立严格的表达能力差距来进一步证明修订的优势:带有修订或重新遮盖的DLMs比没有这些功能的DLMs更具表达能力。我们的结果不仅为DLMs作为最高效的并行采样器的潜力提供了理论依据,还倡导在DLMs中启用修订。
Summary / 总结
The study provides a theoretical foundation for the efficiency of diffusion language models (DLMs) in parallel sampling, showing that DLMs augmented with polynomial-length chain-of-thought (CoT) can simulate any parallel sampling algorithm using an optimal number of sequential steps. Enabling remasking or revision in DLMs further optimizes space complexity, and DLMs with these capabilities are strictly more expressive than those without. This work supports the use of DLMs as the most efficient parallel samplers.
论文为扩散语言模型(DLMs)在并行采样中的高效性提供了理论基础,表明DLMs结合多项式长度的链式思考(CoT)可以使用最优的顺序步骤模拟任何并行采样算法。启用重遮盖或修订功能进一步优化了空间复杂性,作者证明了具有这些能力的DLMs比没有这些能力的DLMs更具表达力。这项工作支持DLMs作为最高效的并行采样器的使用。
FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM
Authors: Yuchen Wu, Jiahe Li, Fabio Tosi, Matteo Poggi, Jin Zheng, Xiao Bai
First: 2025-12-31T17:57:45+00:00 · Latest: 2025-12-31T17:57:45+00:00
Abstract
We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.
中文标题/摘要
标题:FoundationSLAM:利用基础深度模型释放流基础方法在端到端密集视觉SLAM中的潜力
我们提出了FoundationSLAM,一种基于学习的单目密集SLAM系统,解决了先前基于流的方法中缺乏几何一致性的问题,以实现准确和鲁棒的跟踪和建图。我们的核心思想是通过利用基础深度模型的指导,将流估计与几何推理相结合。为此,我们首先开发了一种混合流网络,生成几何感知的对应关系,使不同关键帧之间的深度和姿态推断保持一致。为了确保全局一致性,我们提出了一种双向一致束调整层,该层在多视图约束下联合优化关键帧姿态和深度。此外,我们引入了一种可靠性感知精炼机制,通过区分可靠和不确定区域动态调整流更新过程,形成匹配与优化之间的闭环。广泛的实验表明,FoundationSLAM在多个具有挑战性的数据集上实现了优越的轨迹精度和密集重建质量,同时以每秒18帧的速度实时运行,展示了我们的方法在各种场景下的强大泛化能力和实际应用价值。
Summary / 总结
FoundationSLAM is a learning-based monocular dense SLAM system that improves upon previous flow-based approaches by incorporating geometric consistency through the use of foundation depth models. It employs a Hybrid Flow Network to generate geometry-aware correspondences and a Bi-Consistent Bundle Adjustment Layer to enforce global consistency. Additionally, it includes a Reliability-Aware Refinement mechanism to adaptively refine flow updates. Experiments show that FoundationSLAM provides superior trajectory accuracy and dense reconstruction quality, running at real-time speeds of 18 FPS across various datasets.
FoundationSLAM 是一种单目密集SLAM系统,通过整合基础模型的深度信息来提高跟踪和建图的准确性和鲁棒性。它使用 Hybrid Flow Network 生成几何感知的对应关系,并使用 Bi-Consistent Bundle Adjustment Layer 来确保全局一致性。此外,还引入了 Reliability-Aware Refinement 机制来动态调整流更新过程。实验表明,FoundationSLAM 在多个数据集上在轨迹准确性和密集重建质量方面优于先前的方法,同时保持实时性能,每秒 18 帧。
Efficiently Estimating Data Efficiency for Language Model Fine-tuning
Authors: Gyung Hyun Je, Colin Raffel
First: 2025-12-31T17:37:29+00:00 · Latest: 2025-12-31T17:37:29+00:00
Abstract
While large language models (LLMs) demonstrate reasonable zero-shot capability across many downstream tasks, fine-tuning is a common practice to improve their performance. However, a task's data efficiency--i.e., the number of fine-tuning examples needed to achieve a desired level of performance--is often unknown, resulting in costly cycles of incremental annotation and retraining. Indeed, we demonstrate across a curated set of 30 specialized tasks that performant LLMs may struggle zero-shot but can attain stronger performance after fine-tuning. This motivates the need for methods to predict a task's data efficiency without requiring incremental annotation. After introducing a concrete metric that quantifies a task's data efficiency, we propose using the gradient cosine similarity of low-confidence examples to predict data efficiency based on a small number of labeled samples. We validate our approach on a diverse set of tasks with varying data efficiencies, attaining 8.6% error in overall data efficiency prediction and typically eliminating hundreds of unnecessary annotations on each task. Our experiment results and implementation code are available on GitHub.
中文标题/摘要
标题:高效估计语言模型微调的数据效率
尽管大型语言模型(LLMs)在许多下游任务中表现出合理的零样本能力,但微调是提高其性能的常见做法。然而,任务的数据效率——即达到期望性能水平所需的微调示例数量——通常未知,导致昂贵的逐步注释和重新训练循环。事实上,我们证明在30个精心挑选的专业化任务上,表现良好的LLMs可能在零样本情况下表现不佳,但在微调后可以取得更强的性能。这促使需要方法来预测任务的数据效率,而无需逐步注释。在引入一个具体的度量标准来量化任务的数据效率后,我们提出使用少量标记样本的低置信度示例的梯度余弦相似性来预测数据效率。我们在具有不同数据效率的多样化任务上验证了我们的方法,总体数据效率预测误差为8.6%,通常在每个任务上消除数百个不必要的注释。我们的实验结果和实现代码可在GitHub上获得。
Summary / 总结
This study addresses the challenge of determining the data efficiency for fine-tuning language models, which is crucial for optimizing performance without excessive annotation. The authors propose a method using gradient cosine similarity of low-confidence examples to predict data efficiency based on a small number of labeled samples. They validate this approach across 30 specialized tasks, achieving an 8.6% error rate in overall data efficiency prediction and reducing unnecessary annotations by hundreds on each task.
研究旨在解决确定语言模型微调所需数据效率的难题,这对于优化性能同时减少过度注解至关重要。作者提出了一种方法,利用低置信度样本的梯度余弦相似度来预测数据效率,基于少量标记样本。他们在这30个不同任务上验证了该方法,总体数据效率预测误差率为8.6%,并在每个任务上减少了数百次不必要的注解。
PhysTalk: Language-driven Real-time Physics in 3D Gaussian Scenes
Authors: Luca Collorone, Mert Kiray, Indro Spinelli, Fabio Galasso, Benjamin Busam
First: 2025-12-31T17:32:31+00:00 · Latest: 2025-12-31T17:32:31+00:00
Abstract
Realistic visual simulations are omnipresent, yet their creation requires computing time, rendering, and expert animation knowledge. Open-vocabulary visual effects generation from text inputs emerges as a promising solution that can unlock immense creative potential. However, current pipelines lack both physical realism and effective language interfaces, requiring slow offline optimization. In contrast, PhysTalk takes a 3D Gaussian Splatting (3DGS) scene as input and translates arbitrary user prompts into real time, physics based, interactive 4D animations. A large language model (LLM) generates executable code that directly modifies 3DGS parameters through lightweight proxies and particle dynamics. Notably, PhysTalk is the first framework to couple 3DGS directly with a physics simulator without relying on time consuming mesh extraction. While remaining open vocabulary, this design enables interactive 3D Gaussian animation via collision aware, physics based manipulation of arbitrary, multi material objects. Finally, PhysTalk is train-free and computationally lightweight: this makes 4D animation broadly accessible and shifts these workflows from a "render and wait" paradigm toward an interactive dialogue with a modern, physics-informed pipeline.
中文标题/摘要
标题:PhysTalk: 3D 高斯场景中的语言驱动实时物理
逼真的视觉模拟无处不在,但其创建需要计算时间、渲染和专家动画知识。从文本输入生成开放词汇视觉效果成为一种有前景的解决方案,可以释放巨大的创意潜力。然而,当前的工作流程缺乏物理现实性和有效的语言界面,需要缓慢的离线优化。相比之下,PhysTalk 以3D 高斯点绘(3DGS)场景为输入,将任意用户提示翻译成实时、基于物理的4D 动画。一个大型语言模型(LLM)生成可执行代码,直接通过轻量级代理和粒子动力学修改3DGS 参数。值得注意的是,PhysTalk 是第一个直接将3DGS 与物理模拟器结合的框架,无需依赖耗时的网格提取。尽管保持开放词汇,这种设计使得通过碰撞感知的物理基础操作任意多材料对象的3D 高斯动画交互式成为可能。最后,PhysTalk 是无训练的且计算量轻:这使得4D 动画广泛可及,并将这些工作流程从“渲染等待”范式转向与现代、物理启发式管道的互动对话。
Summary / 总结
PhysTalk is a framework that translates user prompts into real-time, physics-based 4D animations using a 3D Gaussian Splatting (3DGS) scene as input. It leverages a large language model to generate executable code that modifies 3DGS parameters through lightweight proxies and particle dynamics, enabling interactive and collision-aware manipulation of objects. This approach avoids the need for time-consuming mesh extraction and provides a train-free, computationally lightweight solution for creating interactive 4D animations, shifting the workflow from a 'render and wait' paradigm to an interactive dialogue with a physics-informed pipeline.
PhysTalk 是一个框架,通过将用户提示转化为实时的物理基础4D动画,使用3D高斯散点图(3DGS)场景作为输入。它利用大型语言模型生成可执行代码,通过轻量级代理和粒子动力学修改3DGS参数,实现交互式的、碰撞感知的对象操作。这种方法避免了耗时的网格提取需求,并提供了一个无需训练、计算量轻的解决方案,用于创建交互式4D动画,将工作流从“渲染等待”模式转变为一个与现代物理基础管道进行交互对话的模式。
DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments
Authors: Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh
First: 2025-12-31T17:31:29+00:00 · Latest: 2025-12-31T17:31:29+00:00
Comments: Submitted to IEEE Robotics and Automation Letters (RA-L)
Abstract
Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.
中文标题/摘要
标题:DarkEQA:在低光室内环境中评估视觉语言模型的实体问答能力
视觉语言模型(VLMs)越来越多地被用作实体代理的核心推理模块。现有的基准测试在理想的、光线充足的条件下评估其能力,但全天候运行的需求则要求其在广泛的视觉退化条件下表现出色,包括夜间或黑暗环境中的低光条件——这一核心需求已被很大程度上忽视。为应对这一未充分探索的挑战,我们提出了DarkEQA,这是一个开源基准,用于在多级低光条件下评估与实体问答(EQA)相关的感知基本能力。DarkEQA通过在受控退化条件下从第一人称观察中进行问答评估,隔离了感知瓶颈,从而实现可归因的鲁棒性分析。DarkEQA的一个关键设计特点是其物理保真度:视觉退化在线性RAW空间中建模,模拟基于物理的照明下降和传感器噪声,随后通过ISP启发式的渲染管道。我们通过评估一系列最先进的VLMs和低光图像增强(LLIE)模型展示了DarkEQA的实用性。我们的分析系统地揭示了这些视觉条件下的操作限制。我们的代码和基准数据集将在接受后发布。
Summary / 总结
DarkEQA is a benchmark designed to evaluate vision-language models (VLMs) under low-light conditions, addressing the lack of robustness testing in existing benchmarks. It uses a controlled degradation process in RAW space to simulate low-light environments, and evaluates perceptual primitives through question answering from egocentric observations. Key findings show that state-of-the-art VLMs perform poorly under these conditions, highlighting their limitations in low-light settings.
DarkEQA 是一个基准,旨在评估视觉-语言模型在低光条件下的表现,解决了24/7稳健运行的未充分探索挑战。它通过线性RAW空间中的可控降级来模拟低光环境,并评估用于体感问答的感知基本能力。研究结果表明,最先进的视觉-语言模型在低光条件下表现不佳,突显了它们在实际应用中的局限性。
Interpretable Perturbation Modeling Through Biomedical Knowledge Graphs
Authors: Pascal Passigan, Kevin Zhu, Angelina Ning
First: 2025-12-24T04:42:25+00:00 · Latest: 2025-12-31T17:30:56+00:00
Abstract
Understanding how small molecules perturb gene expression is essential for uncovering drug mechanisms, predicting off-target effects, and identifying repurposing opportunities. While prior deep learning frameworks have integrated multimodal embeddings into biomedical knowledge graphs (BKGs) and further improved these representations through graph neural network message-passing paradigms, these models have been applied to tasks such as link prediction and binary drug-disease association, rather than the task of gene perturbation, which may unveil more about mechanistic transcriptomic effects. To address this gap, we construct a merged biomedical graph that integrates (i) PrimeKG++, an augmentation of PrimeKG containing semantically rich embeddings for nodes with (ii) LINCS L1000 drug and cell line nodes, initialized with multimodal embeddings from foundation models such as MolFormerXL and BioBERT. Using this heterogeneous graph, we train a graph attention network (GAT) with a downstream prediction head that learns the delta expression profile of over 978 landmark genes for a given drug-cell pair. Our results show that our framework outperforms MLP baselines for differentially expressed genes (DEG) -- which predict the delta expression given a concatenated embedding of drug features, target features, and baseline cell expression -- under the scaffold and random splits. Ablation experiments with edge shuffling and node feature randomization further demonstrate that the edges provided by biomedical KGs enhance perturbation-level prediction. More broadly, our framework provides a path toward mechanistic drug modeling: moving beyond binary drug-disease association tasks to granular transcriptional effects of therapeutic intervention.
中文标题/摘要
标题:通过生物医学知识图谱实现可解释的扰动建模
理解小分子如何扰动基因表达对于揭示药物机制、预测旁路效应以及识别再定位机会至关重要。尽管先前的深度学习框架已经将多模态嵌入整合到生物医学知识图谱(BKGs)中,并通过图神经网络消息传递范式进一步改进了这些表示,但这些模型的应用主要集中在链接预测和二元药物-疾病关联任务上,而不是基因扰动任务,后者可能揭示更多关于机制转录组效应的信息。为了解决这一差距,我们构建了一个集成图,该图结合了(i) PrimeKG++,这是PrimeKG的扩充版本,包含节点的语义丰富嵌入,以及(ii) LINCS L1000药物和细胞系节点,这些节点使用来自MolFormerXL和BioBERT等基础模型的多模态嵌入进行初始化。使用这个异构图,我们训练了一个带有下游预测头的图注意网络(GAT),该网络学习给定药物-细胞对的978个标志性基因的差异表达谱。我们的结果显示,在不同表达基因(DEG)的支架和随机分割下,我们的框架优于MLP基线——该基线预测差异表达给定药物特征、靶标特征和基线细胞表达的连接嵌入。通过边洗牌和节点特征随机化进行的消融实验进一步证明,生物医学KG提供的边增强了扰动级预测。更广泛地说,我们的框架为机制药物建模提供了一条途径:从二元药物-疾病关联任务转向治疗干预的粒度转录效应。
Summary / 总结
The research aims to understand how small molecules affect gene expression to uncover drug mechanisms and predict off-target effects. The authors construct a merged biomedical graph integrating semantically rich embeddings from PrimeKG++ and LINCS L1000 drug and cell line nodes, and train a graph attention network (GAT) to predict the delta expression profile of over 978 landmark genes for a given drug-cell pair. The results show that their framework outperforms MLP baselines for differentially expressed genes and that the edges from biomedical knowledge graphs enhance perturbation-level prediction.
该研究旨在通过整合来自PrimeKG++和LINCS L1000药物及细胞系节点的语义丰富嵌入,提高对小分子如何影响基因表达的理解。基于此图的图注意力网络(GAT)在预测药物-细胞对的差异基因表达方面优于MLP基线,并且消融实验进一步证实了生物医学知识图谱边在增强扰动级预测中的重要性。
DAVE: A VLM Vision Encoder for Document Understanding and Web Agents
Authors: Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, Roei Herzig
First: 2025-12-19T04:09:24+00:00 · Latest: 2025-12-31T17:30:11+00:00
Abstract
While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.
中文标题/摘要
标题:DAVE:一种用于文档理解和网络代理的VLM视觉编码器
尽管视觉语言模型(VLMs)在多模态任务中表现出色,但它们所选择的视觉编码器存在根本性弱点:其低级特征缺乏文档理解和网络代理所需的稳健的结构和空间信息。为弥补这一差距,我们引入了DAVE,一种专为VLMs设计并针对这些任务定制的视觉编码器。我们的训练管道旨在利用大量未标注数据,以绕过对文档和网络图像的大规模注释成本。我们首先在未标注图像上进行自我监督预训练阶段,然后在监督自回归预训练阶段,模型从有限的高质量数据中学习解析和定位等任务。在监督阶段内,我们采用了两种策略来提高编码器与通用视觉知识和多样化文档及网络代理任务的对齐:(i) 我们引入了一种新的模型合并方案,将使用不同文本解码器训练的编码器结合在一起,以确保与不同网络代理架构的广泛兼容性。(ii) 我们使用集成训练将预训练的通用编码器(如SigLIP2)的特征与我们自己的文档和网络特定表示融合在一起。在经典文档任务、VQAs、网络定位和基于代理的基准测试中的广泛实验验证了我们方法的有效性,确立了DAVE作为文档和网络应用的强大视觉编码器的地位。
Summary / 总结
DAVE is a vision encoder designed to enhance the performance of Vision-language models (VLMs) in document understanding and web agent tasks by leveraging self-supervised and supervised pretraining methods. It uses abundant unlabeled data for initial training and fine-tunes with high-quality data for specific tasks. DAVE incorporates a model-merging scheme and ensemble training to improve its compatibility and effectiveness. Experiments show that DAVE outperforms existing models on document tasks, VQAs, web localization, and agent-based benchmarks.
研究旨在通过改进视觉语言模型(VLMs)的视觉编码器,增强其在文档理解和网页代理方面的性能。提出了专门的视觉编码器DAVE,通过无标签数据的自监督预训练阶段和有限高质量数据的监督自回归预训练阶段进行训练。采用模型合并和集成训练等策略,以提高编码器对通用视觉知识和特定文档及网页任务的适应性。实验验证了DAVE在各种文档和网页应用中的有效性,使其成为这些领域的强大视觉编码器。
A Modal Logic for Possibilistic Reasoning with Fuzzy Formal Contexts
Authors: Prosenjit Howlader, Churn-Jung Liau
First: 2025-12-31T17:27:36+00:00 · Latest: 2025-12-31T17:27:36+00:00
Comments: 25 pages
Abstract
We introduce a two-sort weighted modal logic for possibilistic reasoning with fuzzy formal contexts. The syntax of the logic includes two types of weighted modal operators corresponding to classical necessity ($\Box$) and sufficiency ($\boxminus$) modalities and its formulas are interpreted in fuzzy formal contexts based on possibility theory. We present its axiomatization that is \emph{sound} with respect to the class of all fuzzy context models. In addition, both the necessity and sufficiency fragments of the logic are also individually complete with respect to the class of all fuzzy context models. We highlight the expressive power of the logic with some illustrative examples. As a formal context is the basic construct of formal concept analysis (FCA), we generalize three main notions in FCA, i.e., formal concepts, object oriented concepts, and property oriented concepts, to their corresponding $c$-cut concepts in fuzzy formal contexts. Then, we show that our logical language can represent all three of these generalized notions. Finally, we demonstrate the possibility of extending our logic to reasoning with multi-relational fuzzy contexts, in which the Boolean combinations of different fuzzy relations are allowed.
中文标题/摘要
标题:一种基于模糊形式背景的可能推理的模态逻辑
我们引入了一种带权重的模态逻辑,用于基于可能性理论的模糊形式背景下的可能推理。该逻辑的语法包括对应于经典必然性($\Box$)和充分性($\boxminus$)模态的两种带权重的模态运算符,并且其公式基于模糊形式背景进行解释。我们给出了该逻辑的公理化,该公理化相对于所有模糊背景模型类是\emph{一致}的。此外,该逻辑的必要性和充分性片段也分别相对于所有模糊背景模型类是完整的。我们通过一些示例强调了该逻辑的表达能力。由于形式背景是形式概念分析(FCA)的基本构造,我们将FCA中的三个主要概念,即形式概念、面向对象的概念和面向属性的概念,推广到模糊形式背景中的$c$-切概念。然后,我们展示了我们的逻辑语言可以表示这三个推广的概念。最后,我们展示了将我们的逻辑扩展到多关系模糊背景推理的可能性,在这种背景下,不同模糊关系的布尔组合是允许的。
Summary / 总结
This paper introduces a two-sort weighted modal logic for possibilistic reasoning with fuzzy formal contexts, incorporating necessity and sufficiency modal operators. The logic is axiomatized and shown to be sound and complete for fuzzy context models. It can represent formal concepts and their object and property oriented counterparts in fuzzy contexts, and potentially extend to multi-relational fuzzy contexts.
本文引入了一种用于模糊形式上下文的可能主义推理的两种类加权模态逻辑,包含必要性和充分性模态运算符。该逻辑进行了公理化,并证明其对于模糊上下文模型是正确的和完全的。此外,还将形式概念分析中的主要概念推广到模糊上下文中,并展示了该逻辑能够表示这些概念的能力。最后,概述了将该逻辑扩展到处理多关系模糊上下文的可能性。
Kolmogorov-Arnold Energy Models: Fast and Interpretable Generative Modeling
Authors: Prithvi Raj
First: 2025-06-17T04:07:32+00:00 · Latest: 2025-12-31T17:07:22+00:00
Abstract
Learning an energy-based model (EBM) in the latent space of a top-down generative model offers a powerful framework for generation across many data modalities. However, it remains unclear how its interpretability can be used to guide model design, improve generative quality, and reduce training time. Moreover, the reliance on Langevin Monte Carlo (LMC) sampling presents challenges in efficiency and sampling multimodal latent distributions. We propose a novel adaptation of the Kolmogorov-Arnold representation theorem for generative modeling and introduce the Kolmogorov-Arnold Energy Model (KAEM) to take advantage of structural and inductive biases. By constraining the prior to univariate relationships, KAEM enables fast and exact inference via the inverse transform method. With the low dimensionality of the latent space and suitable inductive biases encoded, we demonstrate that importance sampling (IS) becomes a viable, unbiased, and highly efficient posterior sampler. For domains where IS fails, we introduce a strategy based on population-based LMC, decomposing the posterior into a sequence of annealed distributions to improve LMC mixing. KAEM balances common generative modeling trade-offs, offering fast inference, interpretability, and stable training, while being naturally suited to Zettascale Computing hardware.
中文标题/摘要
标题:柯尔莫哥洛夫-阿诺尔德能量模型:快速且可解释的生成建模
在顶层生成模型的潜在空间中学习能量模型(EBM)为跨多种数据模态的生成提供了强大的框架。然而,其可解释性如何用于指导模型设计、提高生成质量并减少训练时间仍不清楚。此外,对拉格朗日蒙特卡洛(LMC)采样的依赖性带来了效率和采样多模态潜在分布的挑战。我们提出了一种柯尔莫哥洛夫-阿诺尔德表示定理在生成建模中的新颖应用,并引入了柯尔莫哥洛夫-阿诺尔德能量模型(KAEM)以利用结构和归纳偏置。通过将先验约束为单变量关系,KAEM 通过反变换方法实现快速且精确的推理。凭借潜在空间的低维度和合适的归纳偏置编码,我们展示了重要性采样(IS)成为一种可行、无偏且高度高效的后验采样器。对于 IS 失败的领域,我们引入了一种基于群体的 LMC 的策略,将后验分解为一系列退火分布以改善 LMC 混合。KAEM 平衡了常见的生成建模权衡,提供快速推理、可解释性和稳定训练,同时自然适合泽塔级计算硬件。
Summary / 总结
The research aims to enhance the interpretability and efficiency of energy-based models (EBMs) in generative modeling. The authors propose the Kolmogorov-Arnold Energy Model (KAEM), which uses the Kolmogorov-Arnold representation theorem to constrain the prior to univariate relationships, enabling fast and exact inference. KAEM also introduces importance sampling (IS) as an efficient posterior sampler and, for cases where IS fails, a population-based Langevin Monte Carlo (LMC) strategy to improve sampling efficiency. The model demonstrates fast inference, interpretability, and stable training, and is well-suited for Zettascale Computing hardware.
研究旨在提高生成模型中能量模型(EBM)的可解释性和效率。作者提出了Kolmogorov-Arnold Energy Model(KAEM),利用Kolmogorov-Arnold表示定理将先验约束为单变量关系,从而实现快速且精确的推理。KAEM还使用重要性采样(IS)进行高效的后验采样,并为IS不可行的领域引入了一种基于群体的Langevin Monte Carlo(LMC)策略。实验结果表明,KAEM提供了快速推理、可解释性和稳定的训练,并且适合Zettascale Computing硬件。
Large language models and the entropy of English
Authors: Colin Scheibner, Lindsay M. Smith, William Bialek
First: 2025-12-31T16:54:44+00:00 · Latest: 2025-12-31T16:54:44+00:00
Comments: 8 pages, 6 figures
Abstract
We use large language models (LLMs) to uncover long-ranged structure in English texts from a variety of sources. The conditional entropy or code length in many cases continues to decrease with context length at least to $N\sim 10^4$ characters, implying that there are direct dependencies or interactions across these distances. A corollary is that there are small but significant correlations between characters at these separations, as we show from the data independent of models. The distribution of code lengths reveals an emergent certainty about an increasing fraction of characters at large $N$. Over the course of model training, we observe different dynamics at long and short context lengths, suggesting that long-ranged structure is learned only gradually. Our results constrain efforts to build statistical physics models of LLMs or language itself.
Summary / 总结
This study uses large language models to explore the long-range structure in English texts, finding that the conditional entropy continues to decrease with context length up to around 10,000 characters, indicating direct dependencies between distant parts of the text. The distribution of code lengths shows an increasing certainty about a growing fraction of characters as context length increases. The study also reveals that long-range structure is learned gradually during model training, which constrains efforts to develop statistical physics models of language.
研究使用大型语言模型探索英语文本中的长距离结构。研究发现,条件熵或编码长度随着上下文长度的增加而减少,表明这些距离之间存在直接依赖关系。编码长度的分布显示,随着上下文长度的增长,对字符的确定性也在增加。研究还表明,长距离结构在模型训练过程中逐渐形成,这限制了构建语言统计物理模型的努力。
The Impact of LLMs on Online News Consumption and Production
Authors: Hangcheng Zhao, Ron Berman
First: 2025-12-31T16:54:29+00:00 · Latest: 2025-12-31T16:54:29+00:00
Abstract
Large language models (LLMs) change how consumers acquire information online; their bots also crawl news publishers' websites for training data and to answer consumer queries; and they provide tools that can lower the cost of content creation. These changes lead to predictions of adverse impact on news publishers in the form of lowered consumer demand, reduced demand for newsroom employees, and an increase in news "slop." Consequently, some publishers strategically responded by blocking LLM access to their websites using the robots.txt file standard.
Using high-frequency granular data, we document four effects related to the predicted shifts in news publishing following the introduction of generative AI (GenAI). First, we find a consistent and moderate decline in traffic to news publishers occurring after August 2024. Second, using a difference-in-differences approach, we find that blocking GenAI bots can have adverse effects on large publishers by reducing total website traffic by 23% and real consumer traffic by 14% compared to not blocking. Third, on the hiring side, we do not find evidence that LLMs are replacing editorial or content-production jobs yet. The share of new editorial and content-production job listings increases over time. Fourth, regarding content production, we find no evidence that large publishers increased text volume; instead, they significantly increased rich content and use more advertising and targeting technologies.
Together, these findings provide early evidence of some unforeseen impacts of the introduction of LLMs on news production and consumption.
中文标题/摘要
标题:大规模语言模型对在线新闻消费和生产的影响
大规模语言模型(LLMs)改变了消费者在线获取信息的方式;它们的机器人还会爬取新闻出版商的网站以获取训练数据并回答消费者的问题;并且它们提供了可以降低内容创作成本的工具。这些变化导致了对新闻出版商的负面影响预测,包括消费者需求降低、新闻室员工需求减少以及新闻质量下降。因此,一些出版商战略性地通过使用robots.txt文件标准阻止LLM访问其网站。利用高频细粒度数据,我们记录了生成式人工智能(GenAI)引入后新闻出版领域预测转变的四个效应。首先,我们发现新闻出版商的流量在2024年8月后持续且适度下降。其次,使用差分差异方法,我们发现阻止GenAI机器人会对大型出版商产生负面影响,导致总网站流量减少23%,实际消费者流量减少14%。第三,在招聘方面,我们没有发现证据表明LLMs正在取代编辑或内容生产岗位。新编辑和内容生产岗位的招聘比例随时间增加。第四,在内容生产方面,我们没有发现大型出版商增加了文本量;相反,它们显著增加了丰富内容的使用,并更多地使用广告和定向技术。这些发现共同提供了大规模语言模型引入对新闻生产和消费的一些未预见影响的早期证据。
Summary / 总结
The study examines the impact of large language models (LLMs) on online news consumption and production. It finds a moderate decline in traffic to news publishers after August 2024, and that blocking LLMs can reduce total website traffic by 23% and real consumer traffic by 14% for large publishers. There is no evidence of LLMs replacing editorial or content-production jobs, but publishers increased rich content and used more advertising and targeting technologies. The findings suggest some unforeseen impacts of LLMs on news production and consumption.
研究探讨了大型语言模型(LLMs)对在线新闻消费和生产的影响。研究发现,新闻出版商在2024年8月之后的流量出现适度下降,并且阻止LLM访问可以减少大型出版商的总网站流量23%,实际消费者流量减少14%。没有证据表明LLM正在取代编辑或内容生产岗位,但出版商增加了丰富内容的使用,并更多地使用了广告和定向技术。这些发现表明,LLM的引入对新闻生产和消费产生了某些未预见的影响。
ShowUI-$π$: Flow-based Generative Models as GUI Dexterous Hands
Authors: Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou
First: 2025-12-31T16:51:14+00:00 · Latest: 2025-12-31T16:51:14+00:00
Comments: 17 pages, 15 figures
Abstract
Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-$π$, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-$π$ achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.
中文标题/摘要
标题:ShowUI-$π$: 基于流的生成模型作为GUI灵巧之手
在机器人技术和数字环境中实现类人自动化的关键在于能够进行灵巧操作的智能代理。然而,现有的GUI代理依赖于离散的点击预测(x,y),这限制了自由形式、闭环轨迹(例如拖动进度条)的实现,这些轨迹需要连续的、实时的感知和调整。在本研究中,我们开发了ShowUI-$π$,这是第一个基于流的生成模型作为GUI灵巧之手,其设计包括:(i) 统一的离散-连续动作,将离散点击和连续拖动整合到一个共享模型中,以适应多种交互模式;(ii) 基于流的动作生成,用于拖动建模,通过轻量级的动作专家从连续的视觉观察中预测增量光标调整,确保平滑和稳定的轨迹;(iii) 拖动训练数据和基准,我们手动收集并合成了跨越五个领域(例如PowerPoint,Adobe Premiere Pro)的20,000条拖动轨迹,并引入了ScreenDrag基准,该基准具有全面的在线和离线评估协议,用于评估GUI代理的拖动能力。我们的实验表明,专有的GUI代理在ScreenDrag上仍然存在困难(例如Operator得分为13.27,而最好的Gemini-2.5-CUA达到22.18)。相比之下,ShowUI-$π$仅使用4.5亿参数就达到了26.98的得分,这突显了任务的难度和我们方法的有效性。我们希望这项工作能够推动GUI代理向数字世界中类人的灵巧控制发展。代码可在https://github.com/showlab/showui-pi/获取。
Semi-overlapping Multi-bandit Best Arm Identification for Sequential Support Network Learning
Authors: András Antos, András Millinghoffer, Péter Antal
First: 2025-12-31T16:42:00+00:00 · Latest: 2025-12-31T16:42:00+00:00
Comments: 29 pages, 2 figures
Abstract
Many modern AI and ML problems require evaluating partners' contributions through shared yet asymmetric, computationally intensive processes and the simultaneous selection of the most beneficial candidates. Sequential approaches to these problems can be unified under a new framework, Sequential Support Network Learning (SSNL), in which the goal is to select the most beneficial candidate set of partners for all participants using trials; that is, to learn a directed graph that represents the highest-performing contributions. We demonstrate that a new pure-exploration model, the semi-overlapping multi-(multi-armed) bandit (SOMMAB), in which a single evaluation provides distinct feedback to multiple bandits due to structural overlap among their arms, can be used to learn a support network from sparse candidate lists efficiently.
We develop a generalized GapE algorithm for SOMMABs and derive new exponential error bounds that improve the best known constant in the exponent for multi-bandit best-arm identification. The bounds scale linearly with the degree of overlap, revealing significant sample-complexity gains arising from shared evaluations.
From an application point of view, this work provides a theoretical foundation and improved performance guarantees for sequential learning tools for identifying support networks from sparse candidates in multiple learning problems, such as in multi-task learning (MTL), auxiliary task learning (ATL), federated learning (FL), and in multi-agent systems (MAS).
中文标题/摘要
标题:半重叠多臂赌博机最佳臂识别在序列支持网络学习中的应用
许多现代AI和ML问题需要通过共享但不对称、计算密集的过程来评估合作伙伴的贡献,并同时选择最有利的候选人。这些问题的序列方法可以统一在一个新的框架下,即序列支持网络学习(SSNL),其目标是通过试验选择所有参与者中最有利的合作伙伴候选集;即学习一个代表最佳贡献的有向图。我们证明了一种新的纯探索模型,即半重叠多(多臂)赌博机(SOMMAB),其中单次评估由于其臂的结构重叠而为多个赌博机提供不同的反馈,可以用于从稀疏候选列表中高效地学习支持网络。
我们为SOMMAB开发了一种通用的GapE算法,并推导出新的指数误差界,改进了多赌博机最佳臂识别中已知的最佳常数。这些界线性地与重叠程度成比例,揭示了由于共享评估而产生的显著样本复杂度增益。
从应用角度来看,这项工作为从稀疏候选者中识别支持网络的序列学习工具提供了理论基础和改进的性能保证,适用于多任务学习(MTL)、辅助任务学习(ATL)、联邦学习(FL)和多智能体系统(MAS)等多种学习问题。
Summary / 总结
The research aims to address the challenge of evaluating partners' contributions and selecting beneficial candidates in computationally intensive processes. It introduces a new framework called Sequential Support Network Learning (SSNL) and a model called semi-overlapping multi-(multi-armed) bandit (SOMMAB) to efficiently learn a support network from sparse candidate lists. The study develops a generalized GapE algorithm for SOMMAB and derives new error bounds, showing significant sample-complexity gains due to shared evaluations. This work provides theoretical foundations and improved performance guarantees for sequential learning tools in various applications like multi-task learning, auxiliary task learning, federated learning, and multi-agent systems.
研究旨在解决在计算密集型过程中评估合作伙伴贡献和选择有益候选人的挑战。引入了一种新的框架,称为Sequential Support Network Learning (SSNL),以及一种新的模型,称为semi-overlapping multi-(multi-armed) bandit (SOMMAB),以高效地从稀疏候选列表中学习支持网络。研究开发了一种SOMMAB的通用GapE算法,并推导出新的误差界,显示由于共享评估而产生了显著的样本复杂度增益。这项工作为多任务学习、辅助任务学习、联邦学习和多智能体系统等不同应用中的序列学习工具提供了理论基础和改进的性能保证。