arXiv 论文速递

2026-02-12 04:00
Snapshot: 20260212_0400
Biases in the Blind Spot: Detecting What LLMs Fail to Mention
Authors: Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, Oana-Maria Camburu
Venue: ICML 2026
First: 2026-02-10T18:59:56+00:00 · Latest: 2026-02-10T18:59:56+00:00
Comments: 10 pages, Under review at ICML 2026
Abstract
Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these *unverbalized biases*. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model's CoTs. We evaluate our pipeline across six LLMs on three decision tasks (hiring, loan approval, and university admissions). Our technique automatically discovers previously unknown biases in these models (e.g., Spanish fluency, English proficiency, writing formality). In the same run, the pipeline also validates biases that were manually identified by prior work (gender, race, religion, ethnicity). More broadly, our proposed approach provides a practical, scalable path to automatic task-specific bias discovery.
中文标题/摘要
标题:盲点中的偏差:检测LLMs未提及的内容
大型语言模型(LLMs)通常提供链式推理(CoT)推导痕迹,看似合理,但可能隐藏内部偏差。我们称这些为*未言明的偏差*。通过其陈述的推理来监控模型是不可靠的,现有的偏差评估通常需要预定义的类别和手工制作的数据集。在本研究中,我们引入了一个完全自动化的、黑盒的流程来检测任务特定的未言明偏差。给定一个任务数据集,该流程使用LLM自动评分器生成候选偏差概念。然后,通过生成正向和负向变化,在逐步增大的输入样本上测试每个概念,并应用多重测试和早期停止的统计技术。如果一个概念在统计上显著影响性能,但未在模型的CoTs中被提及作为理由,则该概念被标记为未言明的偏差。我们在六种LLM上对三个决策任务(招聘、贷款审批和大学入学)进行了评估。我们的技术自动发现了这些模型中以前未知的偏差(如西班牙语流利度、英语熟练度、写作正式度)。在同一运行中,该流程还验证了先前工作手动识别的偏差(性别、种族、宗教、族裔)。更广泛地说,我们提出的方法提供了一条实用、可扩展的自动任务特定偏差发现途径。
Summary / 总结
This study addresses the issue of unverbalized biases in Large Language Models (LLMs) by introducing an automated pipeline to detect task-specific biases. The method uses LLM autoraters to generate candidate bias concepts and tests them on progressively larger input samples. The pipeline flags concepts as unverbalized biases if they show statistically significant performance differences without being mentioned in the model's chain-of-thought reasoning. The technique was evaluated on six LLMs for three decision tasks, discovering previously unknown biases such as Spanish fluency and validating known biases like gender and race. This approach offers a practical and scalable method for automatic bias discovery.
该研究通过引入一个自动化管道来检测大型语言模型(LLMs)中未明确表述的偏见,该管道利用LLM自动评分器生成候选偏见概念,并通过输入样本进行测试,应用统计技术识别显著性能差异。研究在六种LLM上对三种决策任务进行了评估,发现了诸如西班牙语流利度和英语 proficiency等未知偏见,同时也验证了性别、种族等已知偏见。该方法为LLM中的自动偏见检测提供了一种实用且可扩展的解决方案。
SAGE: Scalable Agentic 3D Scene Generation for Embodied AI
Authors: Hongchi Xia, Xuan Li, Zhaoshuo Li, Qianli Ma, Jiashu Xu, Ming-Yu Liu, Yin Cui, Tsung-Yi Lin, Wei-Chiu Ma, Shenlong Wang, Shuran Song, Fangyin Wei
First: 2026-02-10T18:59:55+00:00 · Latest: 2026-02-10T18:59:55+00:00
Comments: Project Page: https://nvlabs.github.io/sage
Abstract
Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., "pick up a bowl and place it on the table"), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI. Code, demos, and the SAGE-10k dataset can be found on the project page here: https://nvlabs.github.io/sage.
中文标题/摘要
标题:SAGE:可扩展的代理3D场景生成用于具身AI
现实世界的数据收集对具身代理来说仍然代价高昂且不安全,因此需要可扩展、逼真且可模拟的3D环境。然而,现有的场景生成系统往往依赖于基于规则或特定任务的管道,导致生成不自然且物理上无效的场景。我们提出了SAGE,这是一种代理框架,给定用户指定的具身任务(例如,“拿起一个碗并把它放在桌子上”),它能够理解意图并自动大规模生成可模拟的环境。代理结合了用于布局和对象组合的多个生成器以及评估语义合理性、视觉逼真性和物理稳定性的批评者。通过迭代推理和适应性工具选择,它不断自我完善场景,直到满足用户意图和物理有效性。生成的环境是逼真、多样且可以直接部署在现代模拟器中进行策略训练的。仅基于此数据训练的策略表现出明显的扩展趋势,并且能够泛化到未见过的对象和布局,展示了基于模拟的扩展对于具身AI的潜力。项目代码、演示和SAGE-10k数据集可以在项目页面上找到:https://nvlabs.github.io/sage
Summary / 总结
SAGE is an agentic framework designed to generate scalable, realistic, and simulator-ready 3D environments for embodied AI. It uses multiple generators and critics to iteratively refine scenes based on user-specified tasks, ensuring semantic plausibility, visual realism, and physical stability. The resulting environments are diverse and directly deployable in modern simulators, leading to clear scaling trends and generalization to unseen objects and layouts in trained policies.
SAGE 是一个自动框架,用于生成可扩展、逼真且可以直接部署在现代模拟器中的 3D 环境。给定用户指定的任务,SAGE 通过迭代细化场景,确保语义合理性、视觉逼真性和物理稳定性。生成的环境多样且可以直接部署在现代模拟器中,训练出的策略表现出明显的扩展趋势,并能很好地泛化到未见过的对象和布局中。
ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation
Authors: Mingyang Wu, Ashirbad Mishra, Soumik Dey, Shuo Xing, Naveen Ravipati, Hansi Wu, Binbin Li, Zhengzhong Tu
First: 2026-02-10T18:59:51+00:00 · Latest: 2026-02-10T18:59:51+00:00
Comments: Project page: https://myangwu.github.io/ConsID-Gen
Abstract
Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at https://myangwu.github.io/ConsID-Gen.
中文标题/摘要
标题:ConsID-Gen: 视图一致且身份保留的图像到视频生成
图像到视频生成(I2V)根据文本指令将静态图像动画化为时间上连贯的视频序列,同时在视角变化下保持细粒度对象身份,但这一挑战仍然存在。与文本到视频模型不同,现有的I2V管道经常遭受外观漂移和几何失真等缺陷,我们认为这些缺陷源于单视角2D观察的稀疏性和跨模态对齐的薄弱性。我们从数据和模型两个方面来解决这个问题。首先,我们构建了ConsIDVid,这是一个大规模的对象中心数据集,使用可扩展的管道构建高质量、时间对齐的视频,并建立了ConsIDVid-Bench,其中我们提出了一种新的多视图一致性基准测试和评估框架,使用对细微几何和外观偏差敏感的度量标准。我们进一步提出了ConsID-Gen,这是一种视图辅助的I2V生成框架,通过未摆姿势的辅助视图增强第一帧,并通过双流视觉-几何编码器以及文本-视觉连接器融合语义和结构线索,为扩散变换器骨干网络提供统一的条件。在ConsIDVid-Bench上的实验表明,ConsID-Gen在多个指标上表现一致优于其他模型,整体性能最佳,超越了如Wan2.1和HunyuanVideo等领先视频生成模型,在具有挑战性的现实场景中提供更好的身份保真度和时间连贯性。我们将在https://myangwu.github.io/ConsID-Gen/发布我们的模型和数据集。
Summary / 总结
The research addresses the challenge of preserving fine-grained object identity in image-to-video generation while maintaining temporal coherence. The method involves curating ConsIDVid, a large-scale dataset with temporally aligned videos, and proposing ConsID-Gen, a view-assisted generation framework. ConsID-Gen uses a dual-stream visual-geometric encoder and a text-visual connector to condition a Diffusion Transformer, effectively reducing appearance drift and geometric distortion. Experiments show that ConsID-Gen outperforms existing models like Wan2.1 and HunyuanVideo in multiple metrics, particularly in identity fidelity and temporal coherence.
研究旨在解决图像到视频生成中保持物体身份和几何一致性的挑战。提出了ConsID-Gen框架,该框架使用双流视觉几何编码器和文本视觉连接器来条件化扩散变换器。实验表明,ConsID-Gen在多个指标上优于现有模型如Wan2.1和HunyuanVideo,实现了更好的身份保真度和时间连贯性。研究还介绍了面向物体的大型视频数据集ConsIDVid和用于评估多视图一致性的ConsIDVid-Bench基准。
Olaf-World: Orienting Latent Actions for Video World Modeling
Authors: Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou
First: 2026-02-10T18:58:41+00:00 · Latest: 2026-02-10T18:58:41+00:00
Comments: Project page: https://showlab.github.io/Olaf-World/ Code: https://github.com/showlab/Olaf-World
Abstract
Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$Δ$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.
中文标题/摘要
标题:Olaf-World:为视频世界建模定向潜在动作
扩展可控制的世界模型受到动作标签稀缺性的限制。虽然潜在动作学习有望从未标记的视频中提取控制接口,但学习到的潜在动作往往无法在不同上下文中转移:它们会纠缠场景特定的线索,缺乏共享的坐标系统。这是因为标准目标仅在每个片段内操作,无法提供在不同上下文中对齐动作语义的机制。我们的关键见解是,尽管动作未被观察到,但它们的语义效果是可以观察到的,并可以作为共享参考。我们引入了Seq$Δ$-REPA,这是一种序列级控制效果对齐目标,将集成的潜在动作锚定到冻结的自监督视频编码器的时序特征差异上。在此基础上,我们提出了Olaf-World,一种从大规模被动视频中预训练动作条件化视频世界模型的流水线。广泛的实验表明,我们的方法学习到一个更结构化的潜在动作空间,导致更强的零样本动作转移和更高效的数据适应新的控制接口,优于最先进的基线。
Summary / 总结
The research addresses the challenge of scaling action-controllable world models by introducing Seq$Δ$-REPA, a sequence-level control-effect alignment objective. This method aligns latent actions to temporal feature differences from a frozen, self-supervised video encoder, enabling better transfer of actions across contexts. Experiments show that Olaf-World, a pipeline using this approach, learns a more structured latent action space, improving zero-shot action transfer and data efficiency compared to existing methods.
研究旨在通过利用未标注视频数据中的潜在动作学习来解决动作可控世界模型的扩展问题。方法引入了Seq$Δ$-REPA,这是一种序列级控制效果对齐目标,通过使用可观察的语义效果在不同上下文之间对齐潜在动作。实验表明,这种方法能够构建一个更结构化的潜在动作空间,从而实现更好的零样本动作转移和更高效的对新控制界面的适应,优于现有方法。
VideoWorld 2: Learning Transferable Knowledge from Real-world Videos
Authors: Zhongwei Ren, Yunchao Wei, Xiao Yu, Guixun Luo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin
First: 2026-02-10T18:58:19+00:00 · Latest: 2026-02-10T18:58:19+00:00
Comments: Code and models are released at: https://maverickren.github.io/VideoWorld2.github.io/
Abstract
Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.
中文标题/摘要
标题:VideoWorld 2:从真实世界视频中学习可迁移的知识
从未标记的视频数据中学习可迁移的知识并在新环境中应用是智能代理的一项基本能力。本研究介绍了VideoWorld 2,它扩展了VideoWorld并首次直接从原始真实世界视频中学习可迁移的知识。核心上,VideoWorld 2引入了一种动态增强的潜在动力学模型(dLDM),将动作动力学与视觉外观分离:预训练的视频扩散模型处理视觉外观建模,使dLDM能够学习专注于紧凑且有意义的任务相关动力学的潜在代码。这些潜在代码随后被自回归建模以学习任务策略并支持长期推理。我们在具有挑战性的实际手工艺任务上评估了VideoWorld 2,其中先前的视频生成和潜在动力学模型难以可靠地运行。令人惊讶的是,VideoWorld 2在任务成功率上提高了高达70%,并生成了连贯的长执行视频。在机器人学中,我们展示了VideoWorld 2可以从Open-X数据集中获得有效的操作知识,这显著提高了CALVIN上的任务性能。本研究揭示了直接从原始视频中学习可迁移的世界知识的潜力,所有代码、数据和模型将开源供进一步研究。
Summary / 总结
This work introduces VideoWorld 2, which extends VideoWorld to learn transferable knowledge from raw real-world videos. It uses a dynamic-enhanced Latent Dynamics Model (dLDM) to decouple action dynamics from visual appearance, enabling the model to focus on task-related dynamics. VideoWorld 2 shows significant improvements in task success rate and produces coherent long execution videos in challenging real-world handcraft making tasks. In robotics, it also improves task performance on CALVIN by acquiring effective manipulation knowledge from the Open-X dataset.
VideoWorld 2旨在从原始的现实世界视频中学习可转移的知识并应用于新环境。它引入了一种动态增强的潜在动力模型(dLDM),将动作动力学与视觉外观分离,使模型能够专注于任务相关的动力学。该模型在复杂的现实世界手工艺任务和机器人学中显著提高了任务成功率,最高可达70%的提升,并生成了连贯的长执行视频。它还展示了从Open-X数据集中获取有效的操作知识的能力,提高了CALVIN上的任务性能。
VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
Authors: Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, Zhibo Chen
First: 2026-02-10T18:58:01+00:00 · Latest: 2026-02-10T18:58:01+00:00
Abstract
Pretraining Vision-Language-Action (VLA) policies on internet-scale video is appealing, yet current latent-action objectives often learn the wrong thing: they remain anchored to pixel variation rather than action-relevant state transitions, making them vulnerable to appearance bias, nuisance motion, and information leakage. We introduce VLA-JEPA, a JEPA-style pretraining framework that sidesteps these pitfalls by design. The key idea is \emph{leakage-free state prediction}: a target encoder produces latent representations from future frames, while the student pathway sees only the current observation -- future information is used solely as supervision targets, never as input. By predicting in latent space rather than pixel space, VLA-JEPA learns dynamics abstractions that are robust to camera motion and irrelevant background changes. This yields a simple two-stage recipe -- JEPA pretraining followed by action-head fine-tuning -- without the multi-stage complexity of prior latent-action pipelines. Experiments on LIBERO, LIBERO-Plus, SimplerEnv and real-world manipulation tasks show that VLA-JEPA achieves consistent gains in generalization and robustness over existing methods.
中文标题/摘要
标题:VLA-JEPA: 使用潜在世界模型增强视觉-语言-动作模型
在互联网规模的视频上预训练视觉-语言-动作(VLA)策略是诱人的,但当前的潜在动作目标往往学到错误的东西:它们仍然锚定在像素变化上,而不是动作相关的状态转换,这使它们容易受到外观偏差、无关运动和信息泄露的影响。我们引入了VLA-JEPA,这是一种JEPA风格的预训练框架,通过设计避免了这些陷阱。关键思想是“无泄漏状态预测”:目标编码器从未来帧中生成潜在表示,而学生路径仅看到当前观察——未来信息仅用作监督目标,从不作为输入。通过在潜在空间而不是像素空间中预测,VLA-JEPA学会了对摄像机运动和无关背景变化具有鲁棒性的动力学抽象。这提供了一个简单的两阶段食谱——JEPA预训练后跟动作头微调——而无需先前潜在动作管道的多阶段复杂性。在LIBERO、LIBERO-Plus、SimplerEnv和真实世界的操作任务上的实验表明,VLA-JEPA在现有方法上实现了一致的泛化和鲁棒性提升。
Summary / 总结
The research aims to improve VLA policies by addressing the limitations of current latent-action objectives, which are prone to appearance bias and irrelevant motion. VLA-JEPA introduces a leakage-free state prediction approach, where the target encoder predicts future latent representations while the student pathway only sees the current observation. This method enhances robustness and generalization, as demonstrated by experiments on various datasets and real-world manipulation tasks, showing consistent gains over existing methods.
研究旨在通过解决当前潜动作目标的局限性,如外观偏差和干扰运动问题,来提升VLA策略。VLA-JEPA提出了一种无泄漏状态预测方法,目标编码器基于未来帧预测未来的潜空间表示,而学生路径仅观察当前帧。这种方法增强了鲁棒性和泛化能力,实验结果表明,在多个数据集和真实世界操作任务上,VLA-JEPA相比现有方法有持续的改进。
Step-resolved data attribution for looped transformers
Authors: Georgios Kaissis, David Mildenberger, Juan Felipe Gomez, Martin J. Menten, Eleni Triantafillou
First: 2026-02-10T18:57:53+00:00 · Latest: 2026-02-10T18:57:53+00:00
Abstract
We study how individual training examples shape the internal computation of looped transformers, where a shared block is applied for $τ$ recurrent iterations to enable latent reasoning. Existing training-data influence estimators such as TracIn yield a single scalar score that aggregates over all loop iterations, obscuring when during the recurrent computation a training example matters. We introduce \textit{Step-Decomposed Influence (SDI)}, which decomposes TracIn into a length-$τ$ influence trajectory by unrolling the recurrent computation graph and attributing influence to specific loop iterations. To make SDI practical at transformer scale, we propose a TensorSketch implementation that never materialises per-example gradients. Experiments on looped GPT-style models and algorithmic reasoning tasks show that SDI scales excellently, matches full-gradient baselines with low error and supports a broad range of data attribution and interpretability tasks with per-step insights into the latent reasoning process.
中文标题/摘要
标题:循环变压器的分步数据归因
我们研究单个训练样本如何塑造循环变压器的内部计算,其中共享块经过τ次递归迭代应用以实现潜在推理。现有的训练数据影响估计器如TracIn提供一个单一的标量分数,该分数在整个递归迭代中聚合,模糊了训练样本在递归计算中的重要时间点。我们引入了“分步分解影响(SDI)”,通过展开递归计算图并将影响归因于特定的循环迭代来分解TracIn,从而生成长度为τ的影响轨迹。为了使SDI在变压器规模下实用,我们提出了一种TensorSketch实现,该实现从未实现单个样本的梯度。实验表明,SDI在循环GPT风格模型和算法推理任务中表现出色,与全梯度基线相比误差较低,并且能够支持广泛的基于数据的归因和可解释性任务,提供每个步骤对潜在推理过程的洞察。
Summary / 总结
The research aims to understand how individual training examples impact the internal computations of looped transformers, which use a shared block for recurrent iterations. The authors introduce Step-Decomposed Influence (SDI), which decomposes the influence of training examples into a trajectory over the recurrent iterations, providing insights into when during the computation a training example matters. Experiments show that SDI scales well, matches full-gradient baselines with low error, and supports various interpretability tasks by offering per-step insights into the latent reasoning process.
研究探讨了个体训练样本如何影响循环变压器的内部计算,循环变压器通过反复应用共享块来实现潜在推理。作者引入了步分解影响(SDI),将训练样本的影响分解为循环迭代过程中的轨迹,揭示了训练样本在循环计算过程中何时起作用。实验表明,SDI在循环GPT风格模型和算法推理任务中表现出色,与全梯度基线匹配,具有较低的误差,并支持各种解释性任务,提供了关于潜在推理过程的逐步见解。
Causality in Video Diffusers is Separable from Denoising
Authors: Xingjian Bai, Guande He, Zhengqi Li, Eli Shechtman, Xun Huang, Zongze Wu
First: 2026-02-10T18:57:21+00:00 · Latest: 2026-02-10T18:57:21+00:00
Abstract
Causality -- referring to temporal, uni-directional cause-effect relationships between components -- underlies many complex generative processes, including videos, language, and robot trajectories. Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context. In this paper, we show that the causal reasoning in these models is separable from the multi-step denoising process. Through systematic probing of autoregressive video diffusers, we uncover two key regularities: (1) early layers produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and (2) deeper layers exhibit sparse cross-frame attention and primarily perform intra-frame rendering. Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder. Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that SCD significantly improves throughput and per-frame latency while matching or surpassing the generation quality of strong causal diffusion baselines.
中文标题/摘要
标题:视频去噪器中的因果性可与去噪分离
因果性——指时间上单向的原因-结果关系——构成了许多复杂生成过程的基础,包括视频、语言和机器人轨迹。当前的因果扩散模型将时间推理与迭代去噪交织在一起,在所有层、每次去噪步骤以及整个上下文中应用因果注意力。在本文中,我们展示了这些模型中的因果推理可以与多步去噪过程分离。通过对自回归视频去噪器的系统探查,我们发现了两个关键规律:(1) 早期层在去噪步骤中产生高度相似的特征,表明沿扩散轨迹的冗余计算;(2) 深层层表现出稀疏的跨帧注意力,并主要执行帧内渲染。受这些发现的启发,我们引入了可分离因果扩散(SCD),这是一种新的架构,通过因果变换器编码器明确地将每帧一次的时间推理与多步帧间渲染通过轻量级扩散解码器分离。在合成和真实基准的预训练和后训练任务上的广泛实验表明,SCD 在提高吞吐量和每帧延迟的同时,能够匹配或超越强因果扩散基线的生成质量。
Summary / 总结
This paper investigates the separability of causal reasoning from denoising in video diffusers. By probing autoregressive video diffusers, the authors found that early layers produce redundant features and deeper layers focus on intra-frame rendering. Motivated by these findings, they propose Separable Causal Diffusion (SCD), which decouples temporal reasoning from denoising. Experiments show that SCD enhances throughput and per-frame latency while maintaining or improving generation quality compared to existing causal diffusion models.
该论文研究因果扩散模型中因果推理与去噪的可分离性。通过对自回归视频扩散器进行探查,作者发现早期层执行冗余计算,而深层层主要进行帧内渲染。受此启发,他们提出了分离因果扩散(SCD),该模型将每帧的因果推理与多步帧间渲染分离。实验表明,SCD在提高吞吐量和每帧延迟的同时,能够保持或超越现有因果扩散基线的生成质量。
Noisy-Pair Robust Representation Alignment for Positive-Unlabeled Learning
Authors: Hengwei Zhao, Zhengzhong Tu, Zhuo Zheng, Wei Wang, Junjue Wang, Rusty Feagin, Wenzhe Jiao
Venue: ICLR 2026
First: 2025-09-30T18:22:30+00:00 · Latest: 2026-02-10T18:56:09+00:00
Comments: Published at ICLR 2026
Abstract
Positive-Unlabeled (PU) learning aims to train a binary classifier (positive vs. negative) where only limited positive data and abundant unlabeled data are available. While widely applicable, state-of-the-art PU learning methods substantially underperform their supervised counterparts on complex datasets, especially without auxiliary negatives or pre-estimated parameters (e.g., a 14.26% gap on CIFAR-100 dataset). We identify the primary bottleneck as the challenge of learning discriminative representations under unreliable supervision. To tackle this challenge, we propose NcPU, a non-contrastive PU learning framework that requires no auxiliary information. NcPU combines a noisy-pair robust supervised non-contrastive loss (NoiSNCL), which aligns intra-class representations despite unreliable supervision, with a phantom label disambiguation (PLD) scheme that supplies conservative negative supervision via regret-based label updates. Theoretically, NoiSNCL and PLD can iteratively benefit each other from the perspective of the Expectation-Maximization framework. Empirically, extensive experiments demonstrate that: (1) NoiSNCL enables simple PU methods to achieve competitive performance; and (2) NcPU achieves substantial improvements over state-of-the-art PU methods across diverse datasets, including challenging datasets on post-disaster building damage mapping, highlighting its promise for real-world applications. Code: Code will be open-sourced after review.
中文标题/摘要
标题:噪声对的鲁棒表示对齐在正未标学习中的应用
正未标(PU)学习旨在训练二元分类器(正样本 vs. 负样本),其中仅可获得有限的正样本数据和大量的未标数据。尽管应用广泛,但最先进的PU学习方法在复杂数据集上的表现远逊于监督学习方法,尤其是在没有辅助负样本或预估参数的情况下(例如,在CIFAR-100数据集上存在14.26%的差距)。我们发现主要瓶颈在于在不可靠监督下学习判别性表示的挑战。为应对这一挑战,我们提出了NcPU,这是一种无需辅助信息的非对比PU学习框架。NcPU结合了噪声对鲁棒的非对比监督损失(NoiSNCL),该损失能够在不可靠监督下对类内表示进行对齐,以及一种幻影标签消歧(PLD)方案,该方案通过基于后悔的标签更新提供保守的负监督。理论上,NoiSNCL和PLD可以从期望最大化框架的角度相互受益。实验上,广泛的实验表明:(1)NoiSNCL使简单的PU方法能够达到竞争性性能;(2)NcPU在多种数据集上显著优于最先进的PU方法,包括在灾后建筑损坏映射等具有挑战性的数据集上,突显了其在实际应用中的潜力。代码:代码将在审查后开源。
Summary / 总结
The paper addresses the challenge of Positive-Unlabeled (PU) learning where only limited positive data and abundant unlabeled data are available. It proposes NcPU, a non-contrastive PU learning framework that uses a noisy-pair robust supervised non-contrastive loss (NoiSNCL) and a phantom label disambiguation (PLD) scheme to align intra-class representations and provide conservative negative supervision. Experiments show that NcPU significantly outperforms existing PU methods on various datasets, including a 14.26% improvement on CIFAR-100, and demonstrates potential for real-world applications like post-disaster building damage mapping.
论文针对只有有限正样本和大量未标记样本的Positive-Unlabeled (PU) 学习问题,提出了一种非对比的PU学习框架NcPU,该框架结合了噪声对的鲁棒监督非对比损失(NoiSNCL)和幻影标签消歧(PLD)方案,以对齐类内表示并提供保守的负样本监督。实验表明,NoiSNCL使简单的PU方法能够实现竞争力的表现,而NcPU在各种数据集上显著优于现有方法,包括挑战性的数据集如灾后建筑损坏评估。
Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing
Authors: Mohamed Afane, Kayla Laufer, Wenqi Wei, Ying Mao, Junaid Farooq, Ying Wang, Juntao Chen
First: 2026-02-10T18:56:04+00:00 · Latest: 2026-02-10T18:56:04+00:00
Comments: 18 pages
Abstract
Language models have become practical tools for quantum computing education and research, from summarizing technical papers to explaining theoretical concepts and answering questions about recent developments in the field. While existing benchmarks evaluate quantum code generation and circuit design, their understanding of quantum computing concepts has not been systematically measured. Quantum-Audit addresses this gap with 2,700 questions covering core quantum computing topics. We evaluate 26 models from leading organizations. Our benchmark comprises 1,000 expert-written questions, 1,000 questions extracted from research papers using LLMs and validated by experts, plus an additional 700 questions including 350 open-ended questions and 350 questions with false premises to test whether models can correct erroneous assumptions. Human participants scored between 23% and 86%, with experts averaging 74%. Top-performing models exceeded the expert average, with Claude Opus 4.5 reaching 84% accuracy, though top models showed an average 12-point accuracy drop on expert-written questions compared to LLM-generated ones. Performance declined further on advanced topics, dropping to 73% on security questions. Additionally, models frequently accepted and reinforced false premises embedded in questions instead of identifying them, with accuracy below 66% on these critical reasoning tasks.
中文标题/摘要
标题:量子审计:评估LLMs在量子计算推理能力
语言模型已成为量子计算教育和研究的实际工具,从总结技术论文到解释理论概念和回答领域内最新发展的问题。虽然现有的基准测试评估了量子代码生成和电路设计,但对量子计算概念的理解尚未系统性地进行测量。量子审计通过涵盖2700个问题来填补这一空白,这些问题覆盖了核心的量子计算主题。我们评估了来自领先组织的26个模型。我们的基准测试包括1000个专家编写的问题,1000个使用LLM从研究论文中提取并由专家验证的问题,以及额外的700个问题,包括350个开放性问题和350个包含错误前提的问题,以测试模型是否能纠正错误假设。人类参与者得分在23%到86%之间,专家平均得分为74%。表现最佳的模型超过了专家平均水平,Claude Opus 4.5达到84%的准确率,尽管顶级模型在专家编写的问题上的准确率平均比LLM生成的问题低12个百分点。在更高级的主题上,性能进一步下降,安全问题上的准确率仅为73%。此外,模型经常接受并强化了问题中嵌入的错误前提,而不是识别它们,在这些关键推理任务上的准确率低于66%。
Summary / 总结
Quantum-Audit evaluates the reasoning abilities of 26 leading language models on quantum computing concepts through 2,700 questions, including expert-written, LLM-generated, and false premise questions. Models showed varying performance, with top-performing models like Claude Opus 4.5 achieving 84% accuracy but struggling with expert-written questions and advanced topics, scoring only 73% on security questions. Models often accepted false premises, indicating difficulty in critical reasoning tasks.
Quantum-Audit 使用涵盖核心主题的 2,700 个问题评估了 26 个语言模型在量子计算概念上的推理能力。基准包括专家编写的问题、LLM 提取并由专家验证的问题,以及包含错误前提的问题。模型在 LLM 提取的问题上表现优于专家编写的问题,顶级模型的准确率达到 84%,但在安全问题上的准确率下降到 73%。模型经常接受并强化了问题中的错误前提,这些关键推理任务的准确率低于 66%。
Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning
Authors: Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He
First: 2026-02-10T18:55:41+00:00 · Latest: 2026-02-10T18:55:41+00:00
Comments: 41 pages
Abstract
Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.
中文标题/摘要
标题:代理世界模型:无限合成环境下的自主强化学习
大型语言模型(LLM)的最新进展使自主代理能够执行需要与工具和环境进行多轮交互的复杂任务。然而,这种代理训练的扩展受到缺乏多样性和可靠环境的限制。在本文中,我们提出了代理世界模型(AWM),这是一种完全合成环境生成流水线。使用此流水线,我们扩展到涵盖日常场景的1,000个环境,在这些环境中,代理可以与丰富的工具集(每个环境平均35种工具)进行交互并获得高质量的观察结果。值得注意的是,这些环境是通过代码驱动并依托数据库的,提供了比LLM模拟环境更可靠和一致的状态转换。此外,它们相比从现实环境中收集轨迹,使代理交互更加高效。为了展示此资源的有效性,我们对多轮工具使用代理进行了大规模强化学习。得益于完全可执行的环境和可访问的数据库状态,我们还可以设计可靠的奖励函数。在三个基准上的实验表明,仅在合成环境中训练,而不是特定基准环境,可以实现更强的分布外泛化。代码可在https://github.com/Snowflake-Labs/agent-world-model/ 获取。
Summary / 总结
The research aims to address the challenge of scaling agent training by proposing Agent World Model (AWM), a synthetic environment generation pipeline that creates 1,000 diverse and reliable environments for autonomous agents to interact with rich toolsets. Key experimental findings show that training exclusively in these synthetic environments leads to strong out-of-distribution generalization on three benchmarks, demonstrating the effectiveness of AWM in enhancing agent performance. The code is available on GitHub.
研究旨在通过提出Agent World Model (AWM) 合成环境生成管道,解决自主代理训练规模化的挑战,该管道创建了1,000个多样且可靠的环境,供代理与丰富的工具集互动。关键实验结果表明,仅在这些合成环境中训练,能在三个基准测试上实现强大的跨分布泛化,证明了AWM在提升代理性能方面的有效性。代码可在GitHub上获取。
CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs
Authors: Richard Bornemann, Pierluigi Vito Amadori, Antoine Cully
First: 2026-02-10T18:51:39+00:00 · Latest: 2026-02-10T18:51:39+00:00
Comments: Preprint
Abstract
Developing agents capable of open-endedly discovering and learning novel skills is a grand challenge in Artificial Intelligence. While reinforcement learning offers a powerful framework for training agents to master complex skills, it typically relies on hand-designed reward functions. This is infeasible for open-ended skill discovery, where the set of meaningful skills is not known a priori. While recent methods have shown promising results towards automating reward function design, they remain limited to refining rewards for pre-defined tasks. To address this limitation, we introduce Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), a novel framework leveraging Foundation Models (FM) to open-endedly expand and refine a hierarchical skill archive, structured as a directed graph of executable reward functions in code. We show that a goal-conditioned agent trained exclusively on the rewards generated by the discovered SHARP skills learns to solve increasingly long-horizon goals in the Craftax environment. When composed by a high-level FM-based planner, the discovered skills enable a single goal-conditioned agent to solve complex, long-horizon tasks, outperforming both pretrained agents and task-specific expert policies by over $134$% on average. We will open-source our code and provide additional videos $\href{https://sites.google.com/view/code-sharp/homepage}{here}$.
中文标题/摘要
标题:CODE-SHARP:连续开放发现和演化技能作为层次奖励程序
开发能够在开放环境中不断发现和学习新技能的代理是人工智能领域的重大挑战。虽然强化学习为训练代理掌握复杂技能提供了强大的框架,但它通常依赖于手工设计的奖励函数。对于开放技能发现而言,由于有意义的技能集在先验未知,这种方法是不可行的。虽然最近的方法在自动设计奖励函数方面取得了令人鼓舞的结果,但它们仍然局限于细化预定义任务的奖励。为了解决这一局限,我们提出了连续开放发现和演化技能作为层次奖励程序(CODE-SHARP)的新框架,该框架利用基础模型(FM)来开放地扩展和细化层次技能档案,该档案以可执行的代码奖励函数有向图形式结构化。我们展示了仅在发现的SHARP技能生成的奖励上训练的目标条件代理学会了在Craftax环境中解决越来越长的时序目标。当由高级FM基计划组成时,发现的技能使单个目标条件代理能够解决复杂的、长时序任务,平均性能比预训练代理和任务特定专家策略高出134%以上。我们将开源我们的代码,并在此提供额外的视频:https://sites.google.com/view/code-sharp/homepage
Summary / 总结
The paper introduces CODE-SHARP, a framework that uses Foundation Models to continuously discover and evolve hierarchical reward programs for agents, enabling them to learn novel skills without predefined rewards. The goal-conditioned agent trained with rewards from discovered skills solves increasingly complex tasks in the Craftax environment, outperforming pretrained agents and task-specific policies by over 134% on average.
研究旨在开发无需预定义奖励函数即可连续发现和学习新技能的代理,解决传统强化学习的局限性。CODE-SHARP 使用基础模型扩展和细化层次技能档案,使代理能够解决越来越复杂的任务。该代理在 Craftax 环境中的表现优于预训练代理和任务特定专家策略,平均高出 134% 以上。
CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment
Authors: Nanda Rani, Kimberly Milner, Minghao Shao, Meet Udeshi, Haoran Xi, Venkata Sai Charan Putrevu, Saksham Aggarwal, Sandeep K. Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Muhammad Shafique, Ramesh Karri
First: 2026-02-08T15:56:22+00:00 · Latest: 2026-02-10T18:48:10+00:00
Abstract
Real-world offensive security operations are inherently open-ended: attackers explore unknown attack surfaces, revise hypotheses under uncertainty, and operate without guaranteed success. Existing LLM-based offensive agent evaluations rely on closed-world settings with predefined goals and binary success criteria. To address this gap, we introduce CyberExplorer, an evaluation suite with two core components: (1) an open-environment benchmark built on a virtual machine hosting 40 vulnerable web services derived from real-world CTF challenges, where agents autonomously perform reconnaissance, target selection, and exploitation without prior knowledge of vulnerability locations; and (2) a reactive multi-agent framework supporting dynamic exploration without predefined plans. CyberExplorer enables fine-grained evaluation beyond flag recovery, capturing interaction dynamics, coordination behavior, failure modes, and vulnerability discovery signals-bridging the gap between benchmarks and realistic multi-target attack scenarios.
中文标题/摘要
标题:CyberExplorer:在实际攻击模拟环境中评估LLM的进攻性安全能力
实际的进攻性安全操作本质上是开放式的:攻击者探索未知的攻击面,在不确定性下修订假设,并且没有保证的成功。现有的基于LLM的进攻性代理评估依赖于封闭的世界设置,具有预定义的目标和二元的成功标准。为了解决这一差距,我们引入了CyberExplorer,这是一个评估套件,包含两个核心组件:(1) 一个基于虚拟机的开放环境基准,该虚拟机托管了40个源自真实CTF挑战的漏洞Web服务,代理在没有先验漏洞位置知识的情况下自主进行侦察、目标选择和利用;(2) 一个反应式多代理框架,支持动态探索而无需预定义计划。CyberExplorer使评估超越了旗帜恢复,捕捉交互动态、协调行为、失败模式和漏洞发现信号,填补了基准与现实多目标攻击场景之间的差距。
Anagent For Enhancing Scientific Table & Figure Analysis
Authors: Xuehang Guo, Zhiyong Lu, Tom Hope, Qingyun Wang
First: 2026-02-10T18:46:28+00:00 · Latest: 2026-02-10T18:46:28+00:00
Abstract
In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain-specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long-context requirements, pose fundamental obstacles to scientific table \& figure analysis. To quantify these challenges, we introduce AnaBench, a large-scale benchmark featuring $63,178$ instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose Anagent, a multi-agent framework for enhanced scientific table \& figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task-specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five-dimensional quality assessment. We further develop modular training strategies that leverage supervised finetuning and specialized reinforcement learning to optimize individual capabilities while maintaining effective collaboration. Comprehensive evaluation across 170 subdomains demonstrates that Anagent achieves substantial improvements, up to $\uparrow 13.43\%$ in training-free settings and $\uparrow 42.12\%$ with finetuning, while revealing that task-oriented reasoning and context-aware problem-solving are essential for high-quality scientific table \& figure analysis. Our project page: https://xhguo7.github.io/Anagent/.
中文标题/摘要
标题:增强科学表格与图表分析的代理工具
在科学研究中,分析需要准确解读复杂的多模态知识,整合来自不同来源的证据,并基于特定领域的知识得出推论。然而,当前的人工智能(AI)系统在一致展示这些能力方面存在困难。科学表格和图表的复杂性和变异性,以及异构结构和长上下文需求,构成了科学表格与图表分析的基本障碍。为了量化这些挑战,我们引入了AnaBench,这是一个包含来自九个科学领域的63,178个实例的大规模基准,系统地按七个复杂维度分类。为应对这些挑战,我们提出了一种多代理框架Anagent,通过四个专门的代理来增强科学表格与图表分析:规划者将任务分解为可执行的子任务,专家通过有针对性的工具执行检索特定任务信息,解决者综合信息生成连贯的分析,评论家通过五维质量评估进行迭代优化。我们进一步开发了模块化的训练策略,利用监督微调和专门的强化学习来优化个体能力,同时保持有效的协作。在170个子领域中的全面评估表明,Anagent实现了显著的改进,在无微调设置中最高可达13.43%的提升,在微调设置中可达42.12%的提升,揭示了任务导向的推理和上下文感知的问题解决对于高质量的科学表格与图表分析至关重要。我们的项目页面:https://xhguo7.github.io/Anagent/.
Summary / 总结
The paper addresses the challenge of accurately analyzing complex scientific tables and figures, which current AI systems often struggle with. It introduces AnaBench, a benchmark with 63,178 instances, and proposes Anagent, a multi-agent framework comprising Planner, Expert, Solver, and Critic agents. Anagent shows significant improvements in analysis quality, up to 13.43% without fine-tuning and 42.12% with fine-tuning, highlighting the importance of task-oriented reasoning and context-aware problem-solving for high-quality analysis.
论文旨在解决当前AI系统在准确分析复杂科学表格和图表方面的挑战。作者引入了包含63,178个实例的AnaBench基准,这些实例来自九个科学领域。他们提出了一个由分解任务的Planner、检索信息的Expert、综合分析的Solver和迭代改进的Critic组成的多智能体框架。实验结果表明,Anagent在分析质量上取得了显著改进,无微调时可提高13.43%,有微调时可提高42.12%,强调了任务导向的推理和上下文感知的问题解决在科学分析中的重要性。
Designing Multi-Robot Ground Video Sensemaking with Public Safety Professionals
Authors: Puqi Zhou, Ali Asgarov, Aafiya Hussain, Wonjoon Park, Amit Paudyal, Sameep Shrestha, Chia-wei Tang, Michael F. Lighthiser, Michael R. Hieb, Xuesu Xiao, Chris Thomas, Sungsoo Ray Hong
First: 2026-02-09T16:43:37+00:00 · Latest: 2026-02-10T18:41:40+00:00
Abstract
Videos from fleets of ground robots can advance public safety by providing scalable situational awareness and reducing professionals' burden. Yet little is known about how to design and integrate multi-robot videos into public safety workflows. Collaborating with six police agencies, we examined how such videos could be made practical. In Study 1, we presented the first testbed for multi-robot ground video sensemaking. The testbed includes 38 events-of-interest (EoI) relevant to public safety, a dataset of 20 robot patrol videos (10 day/night pairs) covering EoI types, and 6 design requirements aimed at improving current video sensemaking practices. In Study 2, we built MRVS, a tool that augments multi-robot patrol video streams with a prompt-engineered video understanding model. Participants reported reduced manual workload and greater confidence with LLM-based explanations, while noting concerns about false alarms and privacy. We conclude with implications for designing future multi-robot video sensemaking tools.
中文标题/摘要
标题:多机器人地面视频智能分析设计与公共安全专业人员合作
来自地面机器人编队的视频可以通过提供可扩展的情境意识来促进公共安全,从而减轻专业人员的负担。然而,关于如何设计和整合多机器人视频到公共安全工作流程中知之甚少。与六家警局合作,我们研究了如何使这些视频变得实用。在研究1中,我们提出了第一个多机器人地面视频智能分析测试床。该测试床包括38个与公共安全相关的事件-of-兴趣(EoI),涵盖EoI类型的20个机器人巡逻视频(10天/夜对)的数据集,以及6项旨在改进当前视频智能分析实践的设计要求。在研究2中,我们构建了MRVS工具,该工具通过一个根据需求设计的视频理解模型增强了多机器人巡逻视频流。参与者报告称,使用基于LLM的解释后,手动工作量减少且更加自信,但也提到了误报和隐私方面的担忧。我们总结了设计未来多机器人视频智能分析工具的含义。
Summary / 总结
The research aims to enhance public safety by designing and integrating multi-robot ground video sensemaking systems. In Study 1, a testbed was developed with 38 events-of-interest, 20 robot patrol videos, and 6 design requirements. In Study 2, MRVS, a tool that integrates a prompt-engineered video understanding model, was created. Participants found the tool reduced manual workload and improved confidence with LLM-based explanations, but raised concerns about false alarms and privacy. The study concludes with implications for future multi-robot video sensemaking tools.
研究旨在通过设计和集成多机器人地面视频感知系统来提升公共安全。研究1创建了一个包含38个事件-of-兴趣和20个巡逻视频的测试平台,以改进当前的视频感知实践。研究2开发了MRVS工具,该工具使用一个提示工程化的视频理解模型来增强多机器人巡逻视频流,减少了手动工作量并增加了对基于LLM解释的信心,尽管存在误报和隐私方面的担忧。
CAPID: Context-Aware PII Detection for Question-Answering Systems
Authors: Mariia Ponomarenko, Sepideh Abedini, Masoumeh Shafieinejad, D. B. Emerson, Shubhankar Mohapatra, Xi He
First: 2026-02-10T18:41:31+00:00 · Latest: 2026-02-10T18:41:31+00:00
Comments: Accepted to the Student Research Workshop at EACL 2026
Abstract
Detecting personally identifiable information (PII) in user queries is critical for ensuring privacy in question-answering systems. Current approaches mainly redact all PII, disregarding the fact that some of them may be contextually relevant to the user's question, resulting in a degradation of response quality. Large language models (LLMs) might be able to help determine which PII are relevant, but due to their closed source nature and lack of privacy guarantees, they are unsuitable for sensitive data processing. To achieve privacy-preserving PII detection, we propose CAPID, a practical approach that fine-tunes a locally owned small language model (SLM) that filters sensitive information before it is passed to LLMs for QA. However, existing datasets do not capture the context-dependent relevance of PII needed to train such a model effectively. To fill this gap, we propose a synthetic data generation pipeline that leverages LLMs to produce a diverse, domain-rich dataset spanning multiple PII types and relevance levels. Using this dataset, we fine-tune an SLM to detect PII spans, classify their types, and estimate contextual relevance. Our experiments show that relevance-aware PII detection with a fine-tuned SLM substantially outperforms existing baselines in span, relevance and type accuracy while preserving significantly higher downstream utility under anonymization.
中文标题/摘要
标题:CAPID:面向问答系统的上下文感知PII检测
在问答系统中,检测用户查询中的个人可识别信息(PII)对于确保隐私至关重要。当前的方法主要对所有PII进行遮盖,忽略了某些PII可能与用户的问题相关,从而导致响应质量下降。大型语言模型(LLMs)可能有助于确定哪些PII是相关的,但由于它们是闭源的且缺乏隐私保证,因此不适合处理敏感数据。为了实现隐私保护的PII检测,我们提出CAPID,这是一种实用的方法,通过微调本地拥有的小型语言模型(SLM)来过滤敏感信息,然后再将这些信息传递给LLMs进行问答。然而,现有的数据集无法捕捉到训练此类模型所需的PII的上下文相关性。为了填补这一空白,我们提出了一种利用LLMs生成合成数据的管道,以生成多样且领域丰富的数据集,涵盖多种PII类型和相关性级别。使用此数据集,我们微调SLM以检测PII片段、分类其类型并估计上下文相关性。我们的实验表明,具有上下文感知的PII检测与微调后的SLM相比,在片段、相关性和类型准确性方面显著优于现有基线,同时在匿名化下保持了更高的下游实用性。
Summary / 总结
CAPID is a method for detecting personally identifiable information (PII) in user queries while preserving privacy. It fine-tunes a small language model (SLM) that filters sensitive information before passing it to large language models (LLMs) for question-answering. To train this model, a synthetic data generation pipeline was developed using LLMs to create a diverse dataset. Experiments show that CAPID outperforms existing methods in span, relevance, and type accuracy, maintaining higher downstream utility under anonymization.
CAPID 是一种用于在保护隐私的同时检测用户查询中的个人可识别信息 (PII) 的方法。它通过微调一个小语言模型 (SLM) 来过滤 PII,然后再将其传递给大型语言模型 (LLMs) 进行问答。该方法使用基于 LLM 的合成数据生成管道来创建一个多样化的数据集,该数据集能够捕捉 PII 的上下文相关性。实验表明,CAPID 在跨度、相关性和类型准确性方面优于现有方法,同时在匿名化下保持了较高的下游实用性。
From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
Authors: Zhengshen Zhang, Hao Li, Yalun Dai, Zhengbang Zhu, Lei Zhou, Chenchen Liu, Dong Wang, Francis E. H. Tay, Sijin Chen, Ziwei Liu, Yuxiao Liu, Xinghang Li, Pan Zhou
Venue: ICLR 2026
First: 2025-10-20T11:26:45+00:00 · Latest: 2026-02-10T18:32:44+00:00
Comments: ICLR 2026, Project page: https://falcon-vla.github.io/
Abstract
Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height.
中文标题/摘要
标题:从空间到行动:在空间基础先验中接地的视觉-语言-行动模型
现有的视觉-语言-行动(VLA)模型在三维真实世界中行动,但通常基于二维编码器,这导致了空间推理的缺口,限制了泛化能力和适应性。最近的三维整合技术要么需要专门的传感器并且在不同模态间迁移效果差,要么注入弱线索缺乏几何信息并降低视觉-语言对齐。在本工作中,我们引入了FALCON(从空间到行动),这是一种新颖的范式,将丰富的三维空间令牌注入到行动头中。FALCON 利用空间基础模型仅从RGB图像中提供强大的几何先验,并包含一个可选地融合深度或姿态的体感空间模型,以提高精度,而无需重新训练或架构更改。为了保持语言推理,空间令牌被空间增强的行动头消费而不是被连接到视觉-语言主干中。这些设计使FALCON能够解决空间表示、模态可迁移性和对齐的限制。在三个模拟基准和十一个真实世界任务的全面评估中,我们提出的FALCON达到了最先进的性能,始终超越了竞争性基线,并且在杂乱、空间提示条件、物体尺度和高度变化下保持稳健。
Summary / 总结
FALCON addresses the spatial reasoning gap in vision-language-action models by integrating rich 3D spatial tokens into the action head, leveraging spatial foundation models to provide strong geometric priors. It includes an Embodied Spatial Model that can optionally fuse depth or pose for higher fidelity. In evaluations across various benchmarks and real-world tasks, FALCON outperforms existing methods and maintains robust performance under clutter and object variations.
FALCON通过将丰富的3D空间令牌集成到动作头中,结合空间基础模型提供强大的几何先验,解决了视觉-语言-动作模型中的空间推理缺口。它包含一个体感空间模型,可以在不重新训练的情况下融合额外的感官数据以提高精度。全面的评估表明,FALCON在各种基准和真实世界任务中均优于现有方法,并且在杂乱环境、空间提示条件以及物体大小和高度变化下表现出稳健性。
Chain of Mindset: Reasoning with Adaptive Cognitive Modes
Authors: Tianyi Jiang, Arctanx An, Hengyi Feng, Naixin Zhai, Haodong Li, Xiaomin Yu, Jiahui Liu, Hanwen Du, Shuo Zhang, Zhi Yang, Jie Huang, Yuhua Li, Yongxin Ni, Huacan Wang, Ronghao Chen
First: 2026-02-10T18:31:47+00:00 · Latest: 2026-02-10T18:31:47+00:00
Abstract
Human problem-solving is never the repetition of a single mindset, by which we mean a distinct mode of cognitive processing. When tackling a specific task, we do not rely on a single mindset; instead, we integrate multiple mindsets within the single solution process. However, existing LLM reasoning methods fall into a common trap: they apply the same fixed mindset across all steps, overlooking that different stages of solving the same problem require fundamentally different mindsets. This single-minded assumption prevents models from reaching the next level of intelligence. To address this limitation, we propose Chain of Mindset (CoM), a training-free agentic framework that enables step-level adaptive mindset orchestration. CoM decomposes reasoning into four functionally heterogeneous mindsets: Spatial, Convergent, Divergent, and Algorithmic. A Meta-Agent dynamically selects the optimal mindset based on the evolving reasoning state, while a bidirectional Context Gate filters cross-module information flow to maintain effectiveness and efficiency. Experiments across six challenging benchmarks spanning mathematics, code generation, scientific QA, and spatial reasoning demonstrate that CoM achieves state-of-the-art performance, outperforming the strongest baseline by 4.96\% and 4.72\% in overall accuracy on Qwen3-VL-32B-Instruct and Gemini-2.0-Flash, while balancing reasoning efficiency. Our code is publicly available at \href{https://github.com/QuantaAlpha/chain-of-mindset}{https://github.com/QuantaAlpha/chain-of-mindset}.
中文标题/摘要
标题:思维链:适应性认知模式推理
人类解决问题绝非单一思维模式的重复应用,这意味着一种特定的认知处理模式。面对特定任务时,我们不会依赖单一思维模式,而是将多种思维模式整合到单一的解决方案过程中。然而,现有的大模型推理方法往往陷入一个常见陷阱:它们在所有步骤中应用相同的固定思维模式,忽视了解决同一问题的不同阶段需要根本不同的思维模式。这种单一思维模式的假设阻碍了模型达到更高层次的智能。为解决这一局限,我们提出了一种无需训练的代理框架——思维链(CoM),该框架能够实现步骤级别的适应性思维模式编排。CoM 将推理分解为四个功能异质的思维模式:空间思维、收敛思维、发散思维和算法思维。一个元代理根据推理状态的演变动态选择最优思维模式,而双向上下文门控则过滤模块间的信息流,以保持有效性和效率。跨六个涵盖数学、代码生成、科学问答和空间推理的挑战性基准实验表明,CoM 达到了最先进的性能,在 Qwen3-VL-32B-Instruct 和 Gemini-2.0-Flash 上的整体准确率分别比最强基线高出 4.96% 和 4.72%,同时平衡了推理效率。我们的代码已公开发布于 https://github.com/QuantaAlpha/chain-of-mindset。
Summary / 总结
The paper addresses the limitation of existing large language model (LLM) reasoning methods that apply a single fixed mindset throughout the problem-solving process, which overlooks the need for different mindsets at different stages. To overcome this, the authors propose Chain of Mindset (CoM), a training-free framework that dynamically selects among four heterogeneous mindsets (Spatial, Convergent, Divergent, and Algorithmic) based on the evolving reasoning state. Experiments show that CoM outperforms strong baselines by 4.96% and 4.72% in overall accuracy on Qwen3-VL-32B-Instruct and Gemini-2.0-Flash, respectively, while maintaining efficiency across six challenging benchmarks in mathematics, code generation, scientific QA, and spatial reasoning.
研究旨在通过解决大型语言模型(LLMs)倾向于使用单一固定思维模式的问题,提高其在解决问题时的适应性。提出的Chain of Mindset(CoM)框架引入了步骤级别的适应性思维模式编排,将推理分解为四种不同的思维模式:空间、收敛、发散和算法。实验结果显示,CoM在六个基准测试中分别在Qwen3-VL-32B-Instruct和Gemini-2.0-Flash上比强基线高出4.96%和4.72%的整体准确率,同时保持了效率。该框架设计为无需训练并已公开发布。
Evaluating Disentangled Representations for Controllable Music Generation
Authors: Laura Ibáñez-Martínez, Chukwuemeka Nkama, Andrea Poltronieri, Xavier Serra, Martín Rocamora
Venue: ICASSP 2026
First: 2026-02-10T18:25:04+00:00 · Latest: 2026-02-10T18:25:04+00:00
Comments: Accepted at ICASSP 2026
Abstract
Recent approaches in music generation rely on disentangled representations, often labeled as structure and timbre or local and global, to enable controllable synthesis. Yet the underlying properties of these embeddings remain underexplored. In this work, we evaluate such disentangled representations in a set of music audio models for controllable generation using a probing-based framework that goes beyond standard downstream tasks. The selected models reflect diverse unsupervised disentanglement strategies, including inductive biases, data augmentations, adversarial objectives, and staged training procedures. We further isolate specific strategies to analyze their effect. Our analysis spans four key axes: informativeness, equivariance, invariance, and disentanglement, which are assessed across datasets, tasks, and controlled transformations. Our findings reveal inconsistencies between intended and actual semantics of the embeddings, suggesting that current strategies fall short of producing truly disentangled representations, and prompting a re-examination of how controllability is approached in music generation.
中文标题/摘要
标题:评估可分离表示在可控音乐生成中的应用
音乐生成的近期方法依赖于可分离表示,通常标记为结构和音色或局部和全局,以实现可控合成。然而,这些嵌入的基本属性尚未得到充分探索。在本文中,我们使用基于探针的方法评估了一组音乐音频模型中的可分离表示,该方法超越了标准下游任务。所选模型反映了多种不同的无监督分离策略,包括归纳偏置、数据增强、对抗目标和分阶段训练程序。我们进一步分离特定策略以分析其影响。我们的分析涵盖了四个关键维度:信息量、协变性、不变性和分离性,这些维度在不同数据集、任务和受控变换中进行了评估。我们的发现揭示了嵌入的预期和实际语义之间的一致性问题,表明当前策略未能产生真正可分离的表示,并促使重新审视音乐生成中可控性的实现方式。
Summary / 总结
This study evaluates disentangled representations in music generation models to enable controllable synthesis, using a probing-based framework that assesses informativeness, equivariance, invariance, and disentanglement across various models and datasets. The findings indicate that current strategies do not fully achieve the intended disentanglement, suggesting a need for re-evaluation of controllability approaches in music generation.
该研究评估了音乐生成模型中的解耦表示以提高可控合成能力。通过使用探针框架,研究考察了具有不同无监督解耦策略的模型,包括归纳偏置、数据增强、对抗目标和分阶段训练。研究发现当前策略未能完全实现解耦,实际嵌入的语义与预期不符,这促使对音乐生成中的可控性方法进行重新审视。
Universal computation is intrinsic to language model decoding
Authors: Alex Lewandowski, Marlos C. Machado, Dale Schuurmans
First: 2026-01-12T23:07:50+00:00 · Latest: 2026-02-10T18:21:22+00:00
Comments: Minor formatting corrections
Abstract
Language models now provide an interface to express and often solve general problems in natural language, yet their ultimate computational capabilities remain a major topic of scientific debate. Unlike a formal computer, a language model is trained to autoregressively predict successive elements in human-generated text. We prove that chaining a language model's autoregressive output is sufficient to perform universal computation. That is, a language model can simulate the execution of any algorithm on any input. The challenge of eliciting desired computational behaviour can thus be reframed in terms of programmability: the ease of finding a suitable prompt. Strikingly, we demonstrate that even randomly initialized language models are capable of universal computation before training. This implies that training does not give rise to computational expressiveness -- rather, it improves programmability, enabling a natural language interface for accessing these intrinsic capabilities.
中文标题/摘要
标题:通用计算是语言模型解码固有的
语言模型现在提供了一个接口来表达和通常解决自然语言中的通用问题,但它们最终的计算能力仍然是科学界的重大争论话题。与正式计算机不同,语言模型是训练来自回归地预测人类生成文本中的后续元素。我们证明,链接语言模型的自回归输出足以执行通用计算。也就是说,语言模型可以模拟在任何输入上执行任何算法。因此,引发所需计算行为的挑战可以重新表述为可编程性:找到合适提示的容易程度。令人惊讶的是,我们证明即使随机初始化的语言模型在训练前也能够执行通用计算。这表明训练并不会赋予计算表达能力——而是提高可编程性,使自然语言接口能够访问这些固有的能力。
Summary / 总结
The research aims to explore the computational capabilities of language models, proving that they can perform universal computation through chaining their autoregressive output. The study shows that even randomly initialized models can achieve universal computation before training, suggesting that training enhances programmability rather than computational expressiveness. The key finding is that language models inherently possess the ability to simulate any algorithm, which can be accessed through appropriate prompts.
研究旨在探索语言模型的计算能力,这些模型经过训练可以预测文本。研究证明,通过串联模型的输出,可以实现通用计算,即模型可以模拟任何算法。值得注意的是,即使在训练前随机初始化的模型也能够执行通用计算,这表明训练增强了编程能力,而不是计算表达能力。关键发现是,语言模型的内在计算能力可以通过合适的提示来访问。
From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?
Authors: Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manny Sandoval, Deborah Hall, Yasin Silva, Huan Liu
First: 2025-12-02T18:31:18+00:00 · Latest: 2026-02-10T18:19:05+00:00
Comments: Under review
Abstract
The rapid advancement of large language models (LLMs) has opened new possibilities for AI for good applications. As LLMs increasingly mediate online communication, their potential to foster empathy and constructive dialogue becomes an important frontier for responsible AI research. This work explores whether LLMs can serve not only as moderators that detect harmful content, but as mediators capable of understanding and de-escalating online conflicts. Our framework decomposes mediation into two subtasks: judgment, where an LLM evaluates the fairness and emotional dynamics of a conversation, and steering, where it generates empathetic, de-escalatory messages to guide participants toward resolution. To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison. Experiments show that API-based models outperform open-source counterparts in both reasoning and intervention alignment when doing mediation. Our findings highlight both the promise and limitations of current LLMs as emerging agents for online social mediation.
中文标题/摘要
标题:从调节到调解:大语言模型能否在在线争吵中担任调解人?
大型语言模型(LLMs)的迅速发展为AI向善的应用开辟了新的可能性。随着LLMs越来越多地调解在线交流,它们促进同理心和建设性对话的潜力成为负责任AI研究的重要前沿。本研究探讨LLMs是否不仅能作为检测有害内容的调节者,还能作为理解并缓解在线冲突的调解人。我们的框架将调解分解为两个子任务:判断,即LLM评估对话的公平性和情感动态;引导,即生成同理心、缓解冲突的消息,引导参与者走向解决。为了评估调解质量,我们构建了一个基于Reddit的大规模数据集,并提出了一种结合原则评分、用户模拟和人工对比的多阶段评估管道。实验表明,基于API的模型在调解时在推理和干预一致性方面优于开源版本。我们的研究结果突显了当前LLMs作为在线社会调解新兴代理的潜力和局限性。
Summary / 总结
This study investigates whether large language models (LLMs) can act as mediators in online conflicts, moving beyond their role as moderators. The research decomposes mediation into judgment and steering tasks, and evaluates LLMs using a multi-stage pipeline. Experiments indicate that API-based models outperform open-source models in both reasoning and intervention alignment during mediation, suggesting both the potential and current limitations of LLMs in this role.
研究探讨了大型语言模型(LLMs)是否可以在在线冲突中充当调解人,而不仅仅是作为内容审查员。研究将调解分解为判断和引导两个子任务,并使用多阶段评估管道进行评估。实验表明,基于API的模型在推理和干预一致性方面优于开源模型,这既显示了LLMs在这一角色中的潜力,也揭示了当前的局限性。
Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization
Authors: Xinchen Han, Hossam Afifi, Michel Marot, Xilu Wang, Lu Yin
Venue: ICASSP
First: 2026-02-10T18:15:58+00:00 · Latest: 2026-02-10T18:15:58+00:00
Comments: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2026
Abstract
Large Language Models (LLMs) often generate unnecessarily verbose Chain-of-Thought (CoT) reasoning that increases computational costs and latency without proportional performance gains. In this paper, we propose \textbf{F}ine-grained \textbf{G}roup policy \textbf{O}ptimization (\textbf{FGO}), a Reinforcement Learning (RL) algorithm that refines group responses by subdividing them and assigning appropriate weights based on length and entropy, thereby enabling effective CoT compression. Meanwhile, as an enhanced variant of Group Relative Policy Optimization (GRPO), FGO successfully addresses two major limitations of the GRPO: inefficient data utilization and entropy collapse. We evaluate FGO on multiple reasoning LLMs and benchmarks, including MATH500, AIME24, AMC23, and Minerva. Experimental results show that FGO achieves efficient CoT compression without degrading performance, and simultaneously resolves the key limitations of GRPO.
中文标题/摘要
标题:通过细粒度组策略优化实现长链思考压缩
大型语言模型(LLMs)经常生成不必要的冗长链思考(CoT)推理,这增加了计算成本和延迟,但没有相应的性能提升。本文提出了一种细粒度组策略优化(FGO)算法,这是一种强化学习(RL)算法,通过细分和分配基于长度和熵的适当权重来精炼组响应,从而实现有效的CoT压缩。同时,作为组相对策略优化(GRPO)的增强变体,FGO成功解决了GRPO的两个主要问题:数据利用效率低和熵坍塌。我们在多个推理LLMs和基准测试上评估了FGO,包括MATH500、AIME24、AMC23和Minerva。实验结果表明,FGO在不降低性能的情况下实现了有效的CoT压缩,并同时解决了GRPO的关键问题。
Summary / 总结
This paper addresses the issue of verbose Chain-of-Thought (CoT) generation in Large Language Models (LLMs), which increases computational costs without significant performance gains. The authors propose FINE-grained GROUP Policy OPTimization (FGO), a Reinforcement Learning algorithm that subdivides and assigns weights to group responses based on length and entropy, effectively compressing CoT. FGO overcomes the inefficiencies of Group Relative Policy Optimization (GRPO) and maintains performance while compressing CoT on benchmarks like MATH500, AIME24, AMC23, and Minerva.
本文针对大型语言模型(LLMs)中冗长的链式思考(CoT)推理问题,这种推理增加了计算成本但并未显著提升性能。提出的细粒度组策略优化(FGO)方法通过强化学习将组响应细分并根据长度和熵分配权重,有效压缩了CoT。FGO克服了组相对策略优化(GRPO)的数据利用不足和熵坍塌问题。实验结果表明,FGO可以在不牺牲性能的情况下有效压缩CoT。
In-Context Learning Without Copying
Authors: Kerem Sahin, Sheridan Feucht, Adam Belfki, Jannik Brinkmann, Aaron Mueller, David Bau, Chris Wendler
First: 2025-11-07T22:11:11+00:00 · Latest: 2026-02-10T18:10:20+00:00
Abstract
Induction heads are attention heads that perform inductive copying by matching patterns from earlier context and copying their continuations verbatim. As models develop induction heads, they experience a sharp drop in training loss, a phenomenon cited as evidence that induction heads may underlie a wide range of in-context learning (ICL) capabilities. In this work, we investigate whether induction heads are a necessary building block for learning abstractive ICL capabilities (i.e., tasks where the answer is not contained in the input context), or whether such capabilities can emerge independently. We propose Hapax, a training regime that omits the loss contribution of tokens predictable by induction heads. Despite a significant reduction in inductive copying, abstractive ICL capabilities are preserved, with the model achieving higher accuracy than the vanilla model on 13 out of 21 tasks, even though 31.7% of tokens are omitted from the loss. Furthermore, our model achieves lower loss values on token positions that induction heads cannot predict. Mechanistic analysis shows that models trained with Hapax develop fewer and weaker induction heads despite preserving abstractive ICL capabilities. Our findings suggest that the developmental link between induction heads and abstractive ICL capabilities is weaker than previously hypothesized.
中文标题/摘要
标题:上下文学习而不复制
归纳头是通过匹配早期上下文中的模式并原样复制其延续来进行归纳复制的注意力头。随着模型发展出归纳头,它们会经历训练损失的急剧下降,这一现象被用作证据,认为归纳头可能支撑着广泛范围的上下文学习(ICL)能力。在本研究中,我们探讨归纳头是否是学习抽象ICL能力(即答案不在输入上下文中的任务)的必要构建块,或者这些能力是否可以独立出现。我们提出了Hapax,一种训练方案,该方案排除了由归纳头可预测的令牌的损失贡献。尽管显著减少了归纳复制,但抽象ICL能力仍然得到保留,即使在21个任务中有13个任务,模型的准确率高于标准模型,尽管有31.7%的令牌被从损失中省略。此外,我们的模型在归纳头无法预测的令牌位置上实现了更低的损失值。机制分析表明,使用Hapax训练的模型发展出的归纳头更少且更弱,尽管保留了抽象ICL能力。我们的研究结果表明,归纳头与抽象ICL能力之间的发育联系比先前假设的要弱。
Summary / 总结
The study investigates whether induction heads are necessary for abstractive in-context learning (ICL) capabilities, proposing Hapax, a training regime that reduces the loss contribution of tokens predictable by induction heads. Despite omitting 31.7% of tokens from the loss, the model retains abstractive ICL capabilities, achieving higher accuracy on 13 out of 21 tasks and lower loss values on positions that induction heads cannot predict. This suggests that the link between induction heads and abstractive ICL is weaker than previously thought.
研究探讨了归纳头是否是学习抽象的上下文学习(ICL)能力的必要组成部分,还是这些能力可以独立出现。作者提出了Hapax,一种减少由归纳头预测的令牌损失贡献的训练方案。尽管从损失中去除了31.7%的令牌,但模型仍然保留了抽象的ICL能力,在21个任务中有13个任务的准确性更高,并且在归纳头无法预测的位置上实现了更低的损失值。这表明归纳头与抽象ICL之间的关系比之前假设的要弱。
Simple Image Processing and Similarity Measures Can Link Data Samples across Databases through Brain MRI
Authors: Gaurang Sharma, Harri Polonen, Juha Pajula, Jutta Suksi, Jussi Tohka
First: 2026-02-10T18:10:12+00:00 · Latest: 2026-02-10T18:10:12+00:00
Abstract
Head Magnetic Resonance Imaging (MRI) is routinely collected and shared for research under strict regulatory frameworks. These frameworks require removing potential identifiers before sharing. But, even after skull stripping, the brain parenchyma contains unique signatures that can match other MRIs from the same participants across databases, posing a privacy risk if additional data features are available. Current regulatory frameworks often mandate evaluating such risks based on the assessment of a certain level of reasonableness. Prior studies have already suggested that a brain MRI could enable participant linkage, but they have relied on training-based or computationally intensive methods. Here, we demonstrate that linking an individual's skull-stripped T1-weighted MRI, which may lead to re-identification if other identifiers are available, is possible using standard preprocessing followed by image similarity computation. Nearly perfect linkage accuracy was achieved in matching data samples across various time intervals, scanner types, spatial resolutions, and acquisition protocols, despite potential cognitive decline, simulating MRI matching across databases. These results aim to contribute meaningfully to the development of thoughtful, forward-looking policies in medical data sharing.
Summary / 总结
The study addresses the privacy risk of re-identifying individuals through brain MRI data after removing potential identifiers. It shows that using simple image processing and similarity measures, brain MRI samples can be linked across different databases with high accuracy, even when there are variations in time intervals, scanner types, and acquisition protocols. This highlights the need for more stringent policies in medical data sharing to mitigate privacy risks.
研究旨在指出即使移除了标识符,共享脑MRI数据仍存在隐私风险。研究表明,使用简单的图像处理和相似性度量方法可以在数据库之间链接MRI样本,即使在时间间隔、扫描器类型和采集协议存在差异的情况下也能实现近乎完美的匹配准确率。这强调了需要制定严格的监管政策来应对这些风险。
Fake-HR1: Rethinking reasoning of vision language model for synthetic image detection
Authors: Changjiang Jiang, Xinkuan Sha, Fengchang Yu, Jingjing Liu, Jian Liu, Mingqi Fang, Chenfeng Zhang, Wei Lu
Venue: ICASSP 2026
First: 2026-02-10T18:10:08+00:00 · Latest: 2026-02-10T18:10:08+00:00
Comments: Accepted by ICASSP 2026
Abstract
Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model's ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.
中文标题/摘要
标题:Fake-HR1:重新思考视觉语言模型在合成图像检测中的推理方式
近期研究表明,在检测过程中引入链式思考(CoT)推理可以增强模型检测合成图像的能力。然而,过长的推理过程会带来显著的资源开销,包括令牌消耗和延迟,特别是在处理明显伪造的图像时尤为冗余。为了解决这一问题,我们提出了一种名为Fake-HR1的大规模混合推理模型,据我们所知,这是首个能够根据生成检测任务的特性自适应地决定是否需要推理的模型。为了实现这一点,我们设计了一个两阶段训练框架:首先进行混合微调(HFT)以进行冷启动初始化,然后通过混合推理组策略优化(HGRPO)进行在线强化学习,以隐式学习何时选择合适的推理模式。实验结果表明,Fake-HR1能够在不同类型的查询中自适应地进行推理,不仅在推理能力和生成检测性能上超越现有语言模型,还显著提高了响应效率。
Summary / 总结
The paper aims to improve the ability of vision-language models to detect synthetic images by incorporating adaptive reasoning. It introduces Fake-HR1, a hybrid-reasoning model that decides whether to use reasoning based on the task characteristics. The model uses a two-stage training framework: Hybrid Fine-Tuning for initialization and Hybrid-Reasoning Grouped Policy Optimization for learning when to use reasoning. Experimental results show that Fake-HR1 outperforms existing models in both reasoning ability and generative detection performance, while also improving response efficiency.
研究旨在通过引入适应性推理来提高视觉-语言模型检测合成图像的能力。方法包括两阶段训练框架:Hybrid Fine-Tuning (HFT) 用于初始化和Hybrid-Reasoning Grouped Policy Optimization (HGRPO) 用于在线学习。关键发现是,提出的Fake-HR1模型在推理能力和生成检测性能上均优于现有模型,同时提高了响应效率。
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models
Authors: Nam V. Nguyen, Thong T. Doan, Luong Tran, Van Nguyen, Quang Pham
First: 2024-11-01T14:04:36+00:00 · Latest: 2026-02-10T18:09:04+00:00
Comments: 40 pages
Abstract
Mixture of experts (MoE) architectures have become a cornerstone for scaling up and are a key component in most large language models such as GPT-OSS, DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE remains severely constrained by the prohibitive computational costs of training and evaluation, restricting large-scale studies accessible to most researchers. We introduce LibMoE, a unified framework for reproducible, efficient, and extensible MoE research that supports both pretraining and sparse-upcycling regimes. Beyond unified implementations, the framework provides transparent analytical tools for probing routing and expert dynamics. Leveraging this foundation, we conduct a comprehensive analysis along three dimensions: (i) routing dynamics, covering expert selection patterns, routing stability and optimality, and how routing entropy reveals task specialization and expert diversity; (ii) the effect of lightweight initialization on load balancing, demonstrating how subtle changes in router initialization shape early expert utilization; and (iii) training regime differences, revealing how sparse upcycling and full pretraining exhibit distinct routing patterns and stability profiles. By lowering the barrier to entry and standardizing evaluation, along with our comprehensive analysis, LibMoE broadens access to MoE research and establishes a reliable benchmark to guide future innovations. GitHub: \href{https://github.com/Fsoft-AIC/LibMoE}{https://github.com/Fsoft-AIC/LibMoE}.
中文标题/摘要
标题:LIBMoE:大规模语言模型专家混合架构综合基准库
专家混合(MoE)架构已成为扩展的关键基石,并且是大多数大型语言模型(如GPT-OSS、DeepSeek-V3、Llama-4和Gemini-2.5)的重要组成部分。然而,由于训练和评估的高昂计算成本,系统性的MoE研究受到严重限制,限制了大多数研究人员可访问的大规模研究。我们引入了LibMoE,这是一个统一的框架,支持可重复、高效和可扩展的MoE研究,涵盖预训练和稀疏上行编译两种模式。除了统一的实现,该框架还提供了透明的分析工具,用于探究路由和专家动态。基于这一基础,我们从三个维度进行了全面分析:(i)路由动态,涵盖专家选择模式、路由稳定性和最优性,以及路由熵如何揭示任务专业化和专家多样性;(ii)轻量级初始化对负载均衡的影响,展示了路由器初始化的细微变化如何影响早期专家的利用;(iii)训练模式差异,揭示了稀疏上行编译和全预训练表现出不同的路由模式和稳定性特征。通过降低进入门槛、标准化评估,并结合我们的全面分析,LibMoE 扩大了MoE研究的可访问性,并建立了可靠的基准,以指导未来创新。GitHub:https://github.com/Fsoft-AIC/LibMoE。
Summary / 总结
The research introduces LibMoE, a unified framework for MoE (mixture of experts) research, addressing the computational challenges in training and evaluation. It supports pretraining and sparse-upcycling, providing tools to analyze routing dynamics, expert utilization, and training regime differences. Key findings include insights into expert selection patterns, routing stability, and the impact of router initialization on load balancing, which help guide future MoE innovations.
研究介绍了LibMoE,一个统一的MoE(混合专家)研究框架,解决了训练和评估中的计算挑战。它支持预训练和稀疏升级,并提供分析路由动态、专家利用和训练制度差异的工具。关键发现包括专家选择模式、路由稳定性和路由器初始化对负载平衡影响的见解,这些有助于指导未来的MoE创新。
Effectiveness of Binary Autoencoders for QUBO-Based Optimization Problems
Authors: Tetsuro Abe, Masashi Yamashita, Shu Tanaka
First: 2026-02-10T17:59:29+00:00 · Latest: 2026-02-10T17:59:29+00:00
Comments: 14 pages, 5 figures
Abstract
In black-box combinatorial optimization, objective evaluations are often expensive, so high quality solutions must be found under a limited budget. Factorization machine with quantum annealing (FMQA) builds a quadratic surrogate model from evaluated samples and optimizes it on an Ising machine. However, FMQA requires binary decision variables, and for nonbinary structures such as integer permutations, the choice of binary encoding strongly affects search efficiency. If the encoding fails to reflect the original neighborhood structure, small Hamming moves may not correspond to meaningful modifications in the original solution space, and constrained problems can yield many infeasible candidates that waste evaluations. Recent work combines FMQA with a binary autoencoder (bAE) that learns a compact binary latent code from feasible solutions, yet the mechanism behind its performance gains is unclear. Using a small traveling salesman problem as an interpretable testbed, we show that the bAE reconstructs feasible tours accurately and, compared with manually designed encodings at similar compression, better aligns tour distances with latent Hamming distances, yields smoother neighborhoods under small bit flips, and produces fewer local optima. These geometric properties explain why bAE+FMQA improves the approximation ratio faster while maintaining feasibility throughout optimization, and they provide guidance for designing latent representations for black-box optimization.
中文标题/摘要
标题:二元自编码器在QUBO基优化问题中的有效性
在黑盒组合优化中,目标评估往往昂贵,因此必须在有限预算下找到高质量的解。因子化机器与量子退火(FMQA)构建一个二次近似模型,并在伊辛机上对其进行优化。然而,FMQA需要二元决策变量,对于如整数排列等非二元结构,二元编码的选择严重影响搜索效率。如果编码未能反映原始邻域结构,小汉明移动可能不对应原始解空间中的有意义修改,且约束问题会产生许多不可行候选,浪费评估。最近的工作将FMQA与学习可行解的紧凑二元潜在代码的二元自编码器(bAE)结合,但其性能提升机制尚不清楚。使用小型旅行商问题作为可解释的测试平台,我们表明bAE准确重构可行路径,并与相似压缩的手动设计编码相比,更好地将路径距离与潜在汉明距离对齐,产生更平滑的小位翻转邻域,并产生更少的局部最优解。这些几何特性解释了为什么bAE+FMQA在保持可行性的同时更快地提高近似比,它们为设计黑盒优化的潜在表示提供了指导。
Summary / 总结
The paper investigates the effectiveness of binary autoencoders (bAE) in improving the performance of factorization machine with quantum annealing (FMQA) for solving combinatorial optimization problems. The method involves using a bAE to learn a compact binary latent code from feasible solutions, which is then used in FMQA. The key experimental findings show that the bAE reconstructs feasible tours accurately and better aligns tour distances with latent Hamming distances, leading to smoother neighborhoods and fewer local optima, thus improving the approximation ratio faster while maintaining feasibility throughout optimization.
该研究探讨了二元自编码器(bAE)在利用因子机与量子退火(FMQA)解决组合优化问题中的有效性。研究关注二元编码对非二元结构搜索效率的影响。关键发现表明,bAE 改进了路径距离与潜在汉明距离之间的对齐,导致在小位翻转下具有更平滑的邻域,并产生了更少的局部最优解,从而提高了近似比并在优化过程中保持可行性。
Perception with Guarantees: Certified Pose Estimation via Reachability Analysis
Authors: Tobias Ladner, Yasser Shoukry, Matthias Althoff
First: 2026-02-10T17:55:49+00:00 · Latest: 2026-02-10T17:55:49+00:00
Abstract
Agents in cyber-physical systems are increasingly entrusted with safety-critical tasks. Ensuring safety of these agents often requires localizing the pose for subsequent actions. Pose estimates can, e.g., be obtained from various combinations of lidar sensors, cameras, and external services such as GPS. Crucially, in safety-critical domains, a rough estimate is insufficient to formally determine safety, i.e., guaranteeing safety even in the worst-case scenario, and external services might additionally not be trustworthy. We address this problem by presenting a certified pose estimation in 3D solely from a camera image and a well-known target geometry. This is realized by formally bounding the pose, which is computed by leveraging recent results from reachability analysis and formal neural network verification. Our experiments demonstrate that our approach efficiently and accurately localizes agents in both synthetic and real-world experiments.
中文标题/摘要
标题:有保证的感知:基于可达性分析的认证姿态估计
在网络物理系统中的代理越来越多地承担起关键安全任务。确保这些代理的安全通常需要对其姿态进行定位,以便后续采取行动。姿态估计可以从各种组合的激光雷达传感器、摄像头和外部服务(如GPS)中获得。关键的是,在关键安全领域,粗略的估计不足以正式确定安全性,即在最坏情况下保证安全性,而外部服务可能不可靠。我们通过仅从摄像头图像和已知目标几何形状中认证地进行3D姿态估计来解决这个问题。这通过正式限制姿态实现,姿态是通过利用最近的可达性分析和形式神经网络验证的最新成果来计算的。我们的实验表明,我们的方法在合成和真实世界实验中都能高效且准确地定位代理。
Summary / 总结
The paper addresses the challenge of ensuring safety in cyber-physical systems by proposing a certified pose estimation method using reachability analysis and formal neural network verification. The method computes the pose from a camera image and a known target geometry, providing guarantees even in worst-case scenarios. Experiments show that the approach accurately localizes agents in both synthetic and real-world settings.
研究旨在通过提供一种认证的姿态估计方法来确保 cyber-物理系统中代理的安全性。这种方法通过可达性分析和形式神经网络验证来限制从相机图像和已知目标几何形状计算出的姿态估计。实验表明,该方法在合成和真实世界场景中都能高效且准确地定位代理。
ALIVE: Animate Your World with Lifelike Audio-Video Generation
Authors: Ying Guo, Qijun Gan, Yifu Zhang, Jinlai Liu, Yifei Hu, Pan Xie, Dongjun Qian, Yu Zhang, Ruiqi Li, Yuqi Zhang, Ruibiao Lu, Xiaofeng Mei, Bo Han, Xiang Yin, Bingyue Peng, Zehuan Yuan
First: 2026-02-09T14:06:03+00:00 · Latest: 2026-02-10T17:53:52+00:00
Comments: Technical report for ALIVE. Bytedance ALIVE Team. Homepage: https://foundationvision.github.io/Alive/
Abstract
Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page: https://github.com/FoundationVision/Alive.
中文标题/摘要
标题:ALIVE:以逼真音视频生成激活世界
音视频生成正迅速向统一的音视频生成发展。本文介绍了ALIVE,一种适应Sora风格音视频生成和动画的生成模型。特别是,该模型相比T2V基础模型解锁了音视频生成(T2VA)和参考动画(animation)的能力。为了支持音视频同步和参考动画,我们扩展了流行的MMDiT架构,加入了一个联合音视频分支,包括TA-CrossAttn进行时间对齐的跨模态融合和UniTemp-RoPE进行精确的音视频对齐。同时,我们精心设计了一个包含音视频字幕、质量控制等的全面数据管道,以收集高质量的微调数据。此外,我们引入了一个新的基准测试,以进行全面的模型测试和比较。经过数百万高质量数据的持续预训练和微调,ALIVE展示了卓越的性能,持续超越开源模型,并与最先进的商业解决方案相当或超越。通过详细的食谱和基准测试,我们希望ALIVE能够帮助社区更高效地开发音视频生成模型。官方网站:https://github.com/FoundationVision/Alive.
Summary / 总结
ALIVE is a generation model that extends a pretrained Text-to-Video model to support Text-to-Video-Audio and Reference-to-Video-Audio generation. It incorporates a joint audio-video branch with TA-CrossAttn and UniTemp-RoPE for precise synchronization and alignment. After extensive pretraining and finetuning on high-quality data, ALIVE shows superior performance, outperforming open-source models and matching state-of-the-art commercial solutions in audio-visual generation tasks.
ALIVE 是一个增强的 Text-to-Video 模型,实现了音频视频生成和动画,特别是 Sora 风格。它引入了一个联合音频视频分支,包括 TA-CrossAttn 和 UniTemp-RoPE,以提高同步和对齐。经过大量高质量数据的训练后,ALIVE 展现了优越的性能,超越了开源模型,并达到了最先进的商业解决方案。提供了一个新的基准用于全面的模型评估和比较。
Overview of the TREC 2025 RAGTIME Track
Authors: Dawn Lawrie, Sean MacAvaney, James Mayfield, Luca Soldaini, Eugene Yang, Andrew Yates
First: 2026-02-10T17:47:20+00:00 · Latest: 2026-02-10T17:47:20+00:00
Comments: 10 pages, 3 figures, notebook version of the RAGTIME 2025 overview paper
Abstract
The principal goal of the RAG TREC Instrument for Multilingual Evaluation (RAGTIME) track at TREC is to study report generation from multilingual source documents. The track has created a document collection containing Arabic, Chinese, English, and Russian news stories. RAGTIME includes three task types: Multilingual Report Generation, English Report Generation, and Multilingual Information Retrieval (MLIR). A total of 125 runs were submitted by 13 participating teams (and as baselines by the track coordinators) for three tasks. This overview describes these three tasks and presents the available results.
中文标题/摘要
标题:TREC 2025 RAGTIME 轨道概述
TREC RAG 多语言评估仪器(RAGTIME)轨道的主要目标是研究多语言源文件的报告生成。该轨道创建了一个包含阿拉伯语、中文、英语和俄语新闻故事的文档集合。RAGTIME 包括三种任务类型:多语言报告生成、英语报告生成和多语言信息检索(MLIR)。共有13支参赛队伍提交了125次运行(轨道协调员也作为基线提交了)。本文档描述了这三项任务并展示了可用的结果。
Summary / 总结
The RAGTIME track at TREC aims to study report generation from multilingual source documents, creating a document collection in Arabic, Chinese, English, and Russian. It includes three task types: Multilingual Report Generation, English Report Generation, and Multilingual Information Retrieval. A total of 125 runs were submitted by 13 teams, providing results for these tasks.
RAGTIME赛道在TREC的目标是评估来自阿拉伯语、中文、英语和俄语文本的报告生成。它包括三种任务类型:多语言报告生成、英语报告生成和多语言信息检索。共有13支队伍提交了125次运行结果,并展示了这些任务的结果。
History
20260211_0405 20260210_0407 20260209_0330 20260208_0328 20260207_0346 20260206_0343 20260205_0342 20260204_0351 20260202_0327 20260201_0324 20260131_0335 20260130_0334 20260129_0331 20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553