arXiv 论文速递

Muses: Designing, Composing, Generating Nonexistent Fantasy 3D Creatures without Training

Authors: Hexiao Lu, Xiaokun Sun, Zeyu Cai, Hao Guo, Ying Tai, Jian Yang, Zhenyu Zhang

First: 2026-01-06T18:59:57+00:00 · Latest: 2026-01-06T18:59:57+00:00

Comments: Project page: https://luhexiao.github.io/Muses.github.io/

Abstract

We present Muses, the first training-free method for fantastic 3D creature generation in a feed-forward paradigm. Previous methods, which rely on part-aware optimization, manual assembly, or 2D image generation, often produce unrealistic or incoherent 3D assets due to the challenges of intricate part-level manipulation and limited out-of-domain generation. In contrast, Muses leverages the 3D skeleton, a fundamental representation of biological forms, to explicitly and rationally compose diverse elements. This skeletal foundation formalizes 3D content creation as a structure-aware pipeline of design, composition, and generation. Muses begins by constructing a creatively composed 3D skeleton with coherent layout and scale through graph-constrained reasoning. This skeleton then guides a voxel-based assembly process within a structured latent space, integrating regions from different objects. Finally, image-guided appearance modeling under skeletal conditions is applied to generate a style-consistent and harmonious texture for the assembled shape. Extensive experiments establish Muses' state-of-the-art performance in terms of visual fidelity and alignment with textual descriptions, and potential on flexible 3D object editing. Project page: https://luhexiao.github.io/Muses.github.io/.

中文标题/摘要

标题：缪斯：设计、编排和生成非存在幻想3D生物而不需训练

我们提出了缪斯，这是首个无需训练的前馈方法，用于生成幻想3D生物。以往依赖于部分感知优化、手动组装或2D图像生成的方法，由于精细部分操作的复杂性和跨域生成能力有限，往往会产生不现实或不连贯的3D资产。相比之下，缪斯利用了3D骨架，这是一种生物形态的基本表示，以明确和理性的方式编排多样元素。这种骨骼基础将3D内容创作形式化为一种结构感知的设计、编排和生成流水线。缪斯首先通过图约束推理构建一个创意编排的3D骨架，具有连贯的布局和比例。然后，该骨架指导在结构化潜在空间内的体素组装过程，整合来自不同对象的区域。最后，在骨骼条件下应用图像引导的外观建模，以生成与组装形状风格一致且和谐的纹理。大量实验表明，缪斯在视觉保真度和与文本描述的一致性方面达到了最先进的性能，并且在灵活的3D对象编辑方面具有潜力。项目页面：https://luhexiao.github.io/Muses.github.io/

Summary / 总结

Muses is a training-free method for generating fantastic 3D creatures in a feed-forward manner. It uses a 3D skeleton to compose and generate diverse elements, addressing the limitations of previous methods that often produce unrealistic 3D assets. Muses constructs a coherent 3D skeleton through graph-constrained reasoning, guides voxel-based assembly, and applies image-guided appearance modeling to generate style-consistent textures. Experiments show that Muses outperforms existing methods in visual fidelity and alignment with textual descriptions, demonstrating its potential for flexible 3D object editing.

Muses 是一种无需训练的方法，使用前馈方式生成 3D 幻想生物。它通过利用 3D 骨架来组合和生成元素，解决了先前方法中的限制。Muses 首先通过图约束推理创建一个连贯的 3D 骨架，然后在结构化的潜在空间内组装体素，最后应用图像引导的外观建模来生成和谐的纹理。实验表明，Muses 在视觉保真度和与文本描述的对齐方面优于现有方法，并展示了灵活的 3D 对象编辑潜力。

TTrace: Lightweight Error Checking and Diagnosis for Distributed Training

Authors: Haitian Jiang, Shaowei Zhu, Zhen Zhang, Zhenyu Song, Xinwei Fu, Zhen Jia, Yida Wang, Jinyang Li

First: 2025-06-10T22:39:14+00:00 · Latest: 2026-01-06T18:59:23+00:00

Abs · PDF · Code1 · Code2

Abstract

Distributed training is essential for scaling the training of large neural network models, such as large language models (LLMs), across thousands of GPUs. However, the complexity of distributed training programs makes them particularly prone to silent bugs, which do not produce explicit error signals but lead to incorrect training outcomes. Effectively detecting and localizing such silent bugs in distributed training is challenging. Common debugging practices based on monitoring training loss or gradient norm curves are indirect, inefficient, and provide no way to localize bugs. To address those challenges, we design and implement TTrace, the first systematic differential testing system for detecting and localizing silent bugs in distributed training. TTrace aligns intermediate tensors from distributed training with those from a trusted reference implementation. To properly compare the floating-point values in the corresponding tensors, we propose a novel mathematical analysis that provides a guideline for setting tolerances, enabling TTrace to distinguish bug-induced errors from numerical errors. Experimental results demonstrate that TTrace effectively detects 11 existing bugs and 3 new bugs in the widely used Megatron-LM framework, while requiring fewer than 10 lines of code changes. TTrace is effective in various training recipes, including low-precision recipes involving BF16 and FP8. Notably, a popular open-source training framework has already adopted the method proposed by TTrace in its development workflow.

中文标题/摘要

标题：TTrace：分布式训练的轻量级错误检查与诊断

分布式训练对于扩展大规模神经网络模型（如大型语言模型LLMs）的训练至关重要，可以在数千个GPU上进行。然而，分布式训练程序的复杂性使其特别容易出现隐性错误，这些错误不会产生明确的错误信号，但会导致训练结果错误。有效地检测和定位这些隐性错误在分布式训练中极具挑战性。基于监控训练损失或梯度范数曲线的常见调试实践是间接的、低效的，并且无法定位错误。为了解决这些挑战，我们设计并实现了TTrace，这是首个系统化的差异测试系统，用于检测和定位分布式训练中的隐性错误。TTrace将分布式训练中的中间张量与可信参考实现中的张量对齐。为了正确比较相应张量中的浮点值，我们提出了一种新颖的数学分析，提供了设置容差的指南，使TTrace能够区分由错误引起的错误和数值错误。实验结果表明，TTrace有效地检测了广泛使用的Megatron-LM框架中的11个已知错误和3个新错误，而代码更改少于10行。TTrace在各种训练配方中都有效，包括涉及BF16和FP8的低精度配方。值得注意的是，一个流行的开源训练框架已经在其开发流程中采用了TTrace提出的方法。

Summary / 总结

TTrace is a systematic differential testing system designed to detect and localize silent bugs in distributed training of large neural network models. By aligning intermediate tensors from distributed training with a trusted reference implementation and using a novel mathematical analysis to set tolerances, TTrace effectively identifies 14 bugs in the Megatron-LM framework with minimal code changes, demonstrating its effectiveness across various training recipes.

TTrace 是一个系统性的差异测试系统，用于检测和定位分布式训练中大型神经网络模型中的隐式错误。它将分布式训练中的中间张量与可信的参考实现中的张量对齐，并使用新颖的数学分析来设置容差，区分由错误引起的误差和数值误差。实验表明，TTrace 能有效识别 Megatron-LM 框架中的 14 个错误，并且只需少量代码更改即可实现，且适用于各种训练配方，包括低精度配方。

Aligning Text, Images, and 3D Structure Token-by-Token

Authors: Aadarsh Sahoo, Vansh Tibrewal, Georgia Gkioxari

First: 2025-06-09T17:59:37+00:00 · Latest: 2026-01-06T18:58:50+00:00

Comments: Project webpage: https://glab-caltech.github.io/kyvo/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Creating machines capable of understanding the world in 3D is essential in assisting designers that build and edit 3D environments and robots navigating and interacting within a three-dimensional space. Inspired by advances in language and image modeling, we investigate the potential of autoregressive models for a new modality: structured 3D scenes. To this end, we propose a unified LLM framework that aligns language, images, and 3D scenes and provide a detailed ''cookbook'' outlining critical design choices for achieving optimal training and performance addressing key questions related to data representation, modality-specific objectives, and more. We show how to tokenize complex 3D objects to incorporate into our structured 3D scene modality. We evaluate performance across four core 3D tasks -- rendering, recognition, instruction-following, and question-answering -- and four 3D datasets, synthetic and real-world. We show our model's effectiveness on reconstructing complete 3D scenes consisting of complex objects from a single image and on real-world 3D object recognition tasks. Project webpage: https://glab-caltech.github.io/kyvo/

中文标题/摘要

标题：逐个对齐文本、图像和3D结构

在帮助设计师构建和编辑3D环境以及机器人在三维空间中导航和互动方面，理解3D世界的机器是必不可少的。受语言和图像建模进展的启发，我们研究了自回归模型在新模态中的潜力：结构化的3D场景。为此，我们提出了一种统一的LLM框架，将语言、图像和3D场景对齐，并详细阐述了实现最佳训练和性能的关键设计选择，包括数据表示、模态特定目标等。我们展示了如何对复杂3D对象进行分词，以纳入我们的结构化3D场景模态。我们在四个核心3D任务——渲染、识别、指令跟随和问答——以及四个3D数据集（合成和真实世界）上评估了性能。我们展示了我们的模型在从单张图像重建包含复杂对象的完整3D场景以及真实世界3D对象识别任务上的有效性。项目网页：https://glab-caltech.github.io/kyvo/

Summary / 总结

The research aims to develop machines capable of understanding 3D environments, essential for designers and robots. The authors propose a unified LLM framework that aligns text, images, and 3D structures, evaluating performance on four core 3D tasks using four datasets. Key findings include the model's effectiveness in reconstructing 3D scenes from images and recognizing real-world 3D objects.

研究旨在开发能够理解3D环境的机器，这对设计师和机器人至关重要。作者提出了一种统一的LLM框架，将文本、图像和3D结构对齐，并在四个3D任务和四个数据集上评估其性能。该模型能够从单张图像重建复杂的3D场景，并在实际的3D物体识别任务中表现出色。

Automated Semantic Rules Detection (ASRD) for Emergent Communication Interpretation

Authors: Bastien Vanderplaetse, Xavier Siebert, Stéphane Dupont

First: 2026-01-06T18:57:39+00:00 · Latest: 2026-01-06T18:57:39+00:00

Abs · PDF · Code1 · Code2

Abstract

The field of emergent communication within multi-agent systems examines how autonomous agents can independently develop communication strategies, without explicit programming, and adapt them to varied environments. However, few studies have focused on the interpretability of emergent languages. The research exposed in this paper proposes an Automated Semantic Rules Detection (ASRD) algorithm, which extracts relevant patterns in messages exchanged by agents trained with two different datasets on the Lewis Game, which is often studied in the context of emergent communication. ASRD helps at the interpretation of the emergent communication by relating the extracted patterns to specific attributes of the input data, thereby considerably simplifying subsequent analysis.

中文标题/摘要

标题：自动语义规则检测（ASRD）在新兴通信解释中的应用

多智能体系统中的新兴通信领域研究了自主智能体如何在没有显式编程的情况下独立开发通信策略，并适应不同的环境。然而，很少有研究关注新兴语言的可解释性。本文介绍的研究提出了一种自动语义规则检测（ASRD）算法，该算法通过在两个不同数据集上训练的智能体之间交换的消息提取相关模式，这些数据集是在新兴通信背景下经常研究的利斯游戏。ASRD通过将提取的模式与输入数据的特定属性相关联，有助于新兴通信的解释，从而大大简化了后续分析。

Summary / 总结

This paper addresses the interpretability of emergent languages in multi-agent systems by proposing an Automated Semantic Rules Detection (ASRD) algorithm. The algorithm extracts patterns from messages exchanged by agents trained on the Lewis Game datasets, linking these patterns to specific input attributes to facilitate interpretation. Key findings show that ASRD significantly simplifies the analysis of emergent communication strategies developed by autonomous agents.

本文通过提出自动语义规则检测（ASRD）算法，解决了多智能体系统中新兴语言的可解释性问题。该算法从训练于Lewis游戏数据集的智能体间交换的消息中提取模式，并将这些模式与输入数据的具体属性关联起来，以简化后续分析。主要发现表明，ASRD 显著简化了自主智能体开发的新兴通信策略的分析。

A Versatile Multimodal Agent for Multimedia Content Generation

Authors: Daoan Zhang, Wenlin Yao, Xiaoyang Wang, Yebowen Hu, Jiebo Luo, Dong Yu

First: 2026-01-06T18:49:47+00:00 · Latest: 2026-01-06T18:49:47+00:00

Abs · PDF · Code1 · Code2

Abstract

With the advancement of AIGC (AI-generated content) technologies, an increasing number of generative models are revolutionizing fields such as video editing, music generation, and even film production. However, due to the limitations of current AIGC models, most models can only serve as individual components within specific application scenarios and are not capable of completing tasks end-to-end in real-world applications. In real-world applications, editing experts often work with a wide variety of images and video inputs, producing multimodal outputs -- a video typically includes audio, text, and other elements. This level of integration across multiple modalities is something current models are unable to achieve effectively. However, the rise of agent-based systems has made it possible to use AI tools to tackle complex content generation tasks. To deal with the complex scenarios, in this paper, we propose a MultiMedia-Agent designed to automate complex content creation. Our agent system includes a data generation pipeline, a tool library for content creation, and a set of metrics for evaluating preference alignment. Notably, we introduce the skill acquisition theory to model the training data curation and agent training. We designed a two-stage correlation strategy for plan optimization, including self-correlation and model preference correlation. Additionally, we utilized the generated plans to train the MultiMedia-Agent via a three stage approach including base/success plan finetune and preference optimization. The comparison results demonstrate that the our approaches are effective and the MultiMedia-Agent can generate better multimedia content compared to novel models.

中文标题/摘要

标题：一种多功能多模态代理用于多媒体内容生成

随着AIGC（AI生成内容）技术的发展，越来越多的生成模型正在改变视频编辑、音乐生成乃至电影制作等领域。然而，由于当前AIGC模型的局限性，大多数模型只能在特定应用场景中作为单一组件发挥作用，无法在实际应用中端到端地完成任务。在实际应用中，编辑专家通常需要处理各种各样的图像和视频输入，产生多模态输出——视频通常包括音频、文本和其他元素。当前模型难以有效实现这种多模态的整合。然而，基于代理系统的兴起使得使用AI工具应对复杂的生成任务成为可能。为了应对复杂的场景，本文提出了一种多媒体代理，旨在自动化复杂内容的创作。我们的代理系统包括数据生成流水线、内容创作工具库以及一套用于评估偏好对齐的指标。值得注意的是，我们引入了技能获取理论来建模训练数据的收集和代理训练。我们设计了一种两阶段相关策略用于计划优化，包括自我相关和模型偏好相关。此外，我们通过三个阶段的方法利用生成的计划来训练多媒体代理，包括基础/成功计划微调和偏好优化。比较结果表明，我们的方法是有效的，多媒体代理能够生成比新型模型更好的多媒体内容。

Summary / 总结

This paper addresses the limitations of current AIGC models by proposing a MultiMedia-Agent designed to handle multimodal content generation tasks. The agent system includes a data generation pipeline, a tool library, and evaluation metrics. The authors use skill acquisition theory for training data curation and agent training, and a two-stage correlation strategy for plan optimization. Experimental results show that the MultiMedia-Agent outperforms novel models in generating better multimedia content.

研究旨在解决当前AIGC模型在处理多模态内容生成任务方面的局限性。作者提出了一种MultiMedia-Agent，集成了数据生成管道、内容创作工具库和评估指标。该代理使用技能获取理论和两阶段相关策略进行计划优化，并通过三阶段方法进行训练。实验结果表明，MultiMedia-Agent在生成多媒体内容方面优于新型模型。

Characterizing the Robustness of Black-Box LLM Planners Under Perturbed Observations with Adaptive Stress Testing

Authors: Neeloy Chakraborty, John Pohovey, Melkior Ornik, Katherine Driggs-Campbell

First: 2025-05-08T21:50:43+00:00 · Latest: 2026-01-06T18:46:38+00:00

Comments: 30 pages, 24 figures, 6 tables

Abs · PDF · Code1 · Code2

Abstract

Large language models (LLMs) have recently demonstrated success in decision-making tasks including planning, control, and prediction, but their tendency to hallucinate unsafe and undesired outputs poses risks. This unwanted behavior is further exacerbated in environments where sensors are noisy or unreliable. Characterizing the behavior of LLM planners to varied observations is necessary to proactively avoid failures in safety-critical scenarios. We specifically investigate the response of LLMs along two different perturbation dimensions. Like prior works, one dimension generates semantically similar prompts with varied phrasing by randomizing order of details, modifying access to few-shot examples, etc. Unique to our work, the second dimension simulates access to varied sensors and noise to mimic raw sensor or detection algorithm failures. An initial case study in which perturbations are manually applied show that both dimensions lead LLMs to hallucinate in a multi-agent driving environment. However, manually covering the entire perturbation space for several scenarios is infeasible. As such, we propose a novel method for efficiently searching the space of prompt perturbations using adaptive stress testing (AST) with Monte-Carlo tree search (MCTS). Our AST formulation enables discovery of scenarios, sensor configurations, and prompt phrasing that cause language models to act with high uncertainty or even crash. By generating MCTS prompt perturbation trees across diverse scenarios, we show through extensive experiments that offline analyses can be used to proactively understand potential failures that may arise at runtime.

中文标题/摘要

标题：在扰动观测下黑盒大语言模型规划器鲁棒性表征的自适应压力测试

大型语言模型（LLMs）在决策任务中，包括规划、控制和预测方面已经显示出成功，但它们倾向于生成不安全和不希望的输出，这带来了风险。在传感器噪声或不可靠的环境中，这种不良行为进一步加剧。为了在关键安全场景中主动避免失败，有必要表征LLM规划器对不同观测的响应。我们特别研究了LLM在两种不同的扰动维度上的响应。与先前工作类似，一个维度通过随机化细节顺序、修改少量示例的访问等生成语义相似但措辞不同的提示。不同于先前工作，我们的工作中的第二个维度模拟了不同传感器和噪声的访问，以模拟原始传感器或检测算法的故障。初步案例研究显示，手动应用扰动会导致LLM在多智能体驾驶环境中产生幻觉。然而，手动覆盖整个扰动空间在多个场景中是不可行的。因此，我们提出了一种使用蒙特卡洛树搜索（MCTS）的自适应压力测试（AST）方法，以高效地搜索提示扰动的空间。我们的AST公式能够发现导致语言模型以高不确定性甚至崩溃的场景、传感器配置和提示措辞。通过在多种场景下生成MCTS提示扰动树，我们通过大量实验表明，离线分析可以被用来主动理解可能在运行时出现的潜在故障。

Summary / 总结

This study investigates the robustness of large language models (LLMs) in planning tasks under perturbed observations by proposing an adaptive stress testing (AST) method with Monte-Carlo tree search (MCTS). The research characterizes LLMs' behavior under two perturbation dimensions: semantically similar prompts with varied phrasing and varied sensors with noise. Key findings show that both dimensions cause LLMs to hallucinate in a multi-agent driving environment, and the AST method effectively identifies scenarios leading to high uncertainty or crashes, enabling proactive understanding of potential failures.

研究旨在理解大型语言模型（LLMs）在受到观测干扰时的行为，特别是在安全关键场景中的表现。研究考察了两种干扰维度：具有不同措辞的语义相似提示和具有噪声的传感器。为了高效地探索干扰空间，作者提出了使用蒙特卡洛树搜索（MCTS）的自适应压力测试（AST）方法。实验表明，这种方法可以发现导致LLMs行为高度不确定甚至崩溃的场景和提示措辞，从而在运行时环境中实现潜在故障的主动理解。

VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval

Authors: Di Wu, Yixin Wan, Kai-Wei Chang

First: 2025-05-26T17:59:33+00:00 · Latest: 2026-01-06T18:46:16+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Text-to-image retrieval (T2I retrieval) remains challenging because cross-modal embeddings often behave as bags of concepts, underrepresenting structured visual relationships such as pose and viewpoint. We propose Visualize-then-Retrieve (VisRet), a retrieval paradigm that mitigates this limitation of cross-modal similarity alignment. VisRet first projects textual queries into the image modality via T2I generation, then performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Across four benchmarks (Visual-RAG, INQUIRE-Rerank, Microsoft COCO, and our new Visual-RAG-ME featuring multi-entity comparisons), VisRet substantially outperforms cross-modal similarity matching and baselines that recast T2I retrieval as text-to-text similarity matching, improving nDCG@30 by 0.125 on average with CLIP as the retriever and by 0.121 with E5-V. For downstream question answering, VisRet increases accuracy on Visual-RAG and Visual-RAG-ME by 3.8% and 15.7% in top-1 retrieval, and by 3.9% and 11.1% in top-10 retrieval. Ablation studies show compatibility with different T2I instruction LLMs, T2I generation models, and downstream LLMs. VisRet provides a simple yet effective perspective for advancing in text-image retrieval. Our code and the new benchmark are publicly available at https://github.com/xiaowu0162/Visualize-then-Retrieve.

中文标题/摘要

标题：VisRet：可视化提高知识密集型文本到图像检索

文本到图像检索（T2I检索）仍然具有挑战性，因为跨模态嵌入往往表现为概念的集合，未能充分表示诸如姿态和视角等结构化的视觉关系。我们提出了可视化然后检索（VisRet）检索范式，以缓解跨模态相似性对齐的这一局限性。VisRet 首先通过T2I生成将文本查询投影到图像模态，然后在图像模态内进行检索，以绕过跨模态检索器在识别细微的视觉空间特征方面的弱点。在四个基准测试（Visual-RAG、INQUIRE-Rerank、Microsoft COCO以及我们的新基准Visual-RAG-ME，包含多实体比较）上，VisRet 显著优于跨模态相似性匹配和将T2I检索重新表述为文本到文本相似性匹配的基线，使用CLIP作为检索器时，平均nDCG@30提高了0.125，使用E5-V时提高了0.121。对于下游问答，VisRet 在Visual-RAG和Visual-RAG-ME上的top-1检索准确率分别提高了3.8%和15.7%，top-10检索准确率分别提高了3.9%和11.1%。消融研究显示，VisRet 与不同的T2I指令LLM、T2I生成模型和下游LLM兼容。VisRet 提供了一种简单而有效的视角，以推进文本图像检索。我们的代码和新基准已公开发布在https://github.com/xiaowu0162/Visualize-then-Retrieve。

Summary / 总结

The paper addresses the challenge of text-to-image retrieval by proposing VisRet, which first converts textual queries into images through text-to-image generation and then retrieves images based on visual features. VisRet outperforms cross-modal similarity matching and text-to-text similarity matching methods across four benchmarks, improving nDCG@30 by an average of 0.125 with CLIP and 0.121 with E5-V. It also enhances question answering accuracy on Visual-RAG and Visual-RAG-ME by 3.8% to 15.7% in top-1 retrieval and 3.9% to 11.1% in top-10 retrieval.

论文提出VisRet，该方法首先通过文本到图像生成将文本查询转换为图像模态，然后在图像模态中进行检索。这种方法在CLIP和E5-V上分别将nDCG@30平均提高了0.125和0.121。它还在Visual-RAG和Visual-RAG-ME基准上的top-1检索中提高了3.8%到15.7%的问答准确性，在top-10检索中提高了3.9%到11.1%。消融研究证实了其与各种模型的兼容性。VisRet提供了一种简单而有效的文本到图像检索解决方案。

STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning

Authors: Juntong Ni, Shiyu Wang, Ming Jin, Qi He, Wei Jin

First: 2026-01-06T18:46:12+00:00 · Latest: 2026-01-06T18:46:12+00:00

Comments: preprint, we release our code publicly at https://github.com/LingFengGold/STReasoner

Abs · PDF · Code1 · Code2 · Code3

Abstract

Spatio-temporal reasoning in time series involves the explicit synthesis of temporal dynamics, spatial dependencies, and textual context. This capability is vital for high-stakes decision-making in systems such as traffic networks, power grids, and disease propagation. However, the field remains underdeveloped because most existing works prioritize predictive accuracy over reasoning. To address the gap, we introduce ST-Bench, a benchmark consisting of four core tasks, including etiological reasoning, entity identification, correlation reasoning, and in-context forecasting, developed via a network SDE-based multi-agent data synthesis pipeline. We then propose STReasoner, which empowers LLM to integrate time series, graph structure, and text for explicit reasoning. To promote spatially grounded logic, we introduce S-GRPO, a reinforcement learning algorithm that rewards performance gains specifically attributable to spatial information. Experiments show that STReasoner achieves average accuracy gains between 17% and 135% at only 0.004X the cost of proprietary models and generalizes robustly to real-world data.

中文标题/摘要

标题：STReasoner：通过空间感知强化学习赋能LLM在时间序列中的时空推理

时间序列中的时空推理涉及显式合成时间动态、空间依赖性和文本上下文。这种能力对于交通网络、电力网络和疾病传播等高风险决策系统至关重要。然而，该领域仍处于起步阶段，因为大多数现有工作更侧重于预测准确性而非推理。为解决这一差距，我们引入了ST-Bench，这是一个包含四个核心任务的基准，包括病因推理、实体识别、相关性推理和上下文内预测，这些任务是通过基于网络SDE的多智能体数据合成管道开发的。然后，我们提出了STReasoner，它使LLM能够整合时间序列、图结构和文本进行显式推理。为了促进空间地逻辑，我们引入了S-GRPO，这是一种强化学习算法，奖励那些特别归因于空间信息的表现提升。实验表明，STReasoner在仅0.004倍于专有模型成本的情况下实现了17%至135%的平均准确率提升，并且能够稳健地泛化到真实世界数据。

Summary / 总结

The paper introduces STReasoner, a method that enhances LLMs for spatio-temporal reasoning in time series by integrating spatial-aware reinforcement learning. STReasoner addresses the gap in existing works by focusing on reasoning rather than just predictive accuracy. Key experimental results show that STReasoner achieves significant accuracy gains of 17% to 135% with minimal cost compared to proprietary models and demonstrates robust generalization to real-world data.

论文提出了STReasoner方法，通过结合空间感知强化学习来增强LLMs在时间序列中的时空推理能力。STReasoner通过关注推理而非仅预测准确性来填补现有工作的空白。实验结果显示，STReasoner在与专有模型成本相比的情况下，实现了17%到135%的显著准确率提升，并且在实际数据上表现出良好的泛化能力。

ShareChat: A Dataset of Chatbot Conversations in the Wild

Authors: Yueru Yan, Tuc Nguyen, Bo Su, Melissa Lieffers, Thai Le

First: 2025-12-19T17:47:53+00:00 · Latest: 2026-01-06T18:45:37+00:00

Abs · PDF · Code1 · Code2

Abstract

While academic research typically treats Large Language Models (LLM) as generic text generators, they are distinct commercial products with unique interfaces and capabilities that fundamentally shape user behavior. Current datasets obscure this reality by collecting text-only data through uniform interfaces that fail to capture authentic chatbot usage. To address this limitation, we present ShareChat, a large-scale corpus of 142,808 conversations (660,293 turns) sourced directly from publicly shared URLs on ChatGPT, Perplexity, Grok, Gemini, and Claude. ShareChat distinguishes itself by preserving native platform affordances, such as citations and thinking traces, across a diverse collection covering 101 languages and the period from April 2023 to October 2025. Furthermore, ShareChat offers substantially longer context windows and greater interaction depth than prior datasets. To illustrate the dataset's breadth, we present three case studies: a completeness analysis of intent satisfaction, a citation study of model grounding, and a temporal analysis of engagement rhythms. This work provides the community with a vital and timely resource for understanding authentic user-LLM chatbot interactions in the wild. The dataset will be publicly available.

中文标题/摘要

标题：ShareChat：野生聊天机器人对话数据集

尽管学术研究通常将大型语言模型（LLM）视为通用文本生成器，但它们是独特的商业产品，具有独特的界面和能力，从根本上影响用户行为。当前的数据集通过统一的界面收集文本数据，未能捕捉到真实的聊天机器人使用情况。为解决这一局限性，我们介绍了ShareChat，这是一个包含142,808场对话（660,293轮对话）的大规模语料库，直接来源于ChatGPT、Perplexity、Grok、Gemini和Claude等公开共享的URL。ShareChat通过保留来自101种语言的多样化集合中的原生平台功能（如引用和思考痕迹），并在2023年4月至2025年10月期间覆盖了这一时期。此外，ShareChat提供了比先前数据集更长的上下文窗口和更深入的交互。为了展示数据集的广度，我们提出了三个案例研究：意图满足的完整性分析、模型基础的引用研究以及参与节奏的时间分析。这项工作为社区提供了一个重要的及时资源，用于理解真实的用户-LLM聊天机器人交互。该数据集将公开可用。

Summary / 总结

The paper introduces ShareChat, a dataset of 142,808 conversations (660,293 turns) from various chatbots, aimed at capturing authentic user interactions. It distinguishes itself by preserving native platform features and covering 101 languages over a two-year period. Key findings include longer context windows and deeper interaction compared to previous datasets, illustrated through case studies on intent satisfaction, model grounding, and engagement rhythms. The dataset offers a valuable resource for studying real-world chatbot interactions.

论文介绍了ShareChat数据集，包含来自多个聊天机器人的142,808次对话（660,293轮次），旨在捕捉真实的用户交互。该数据集保留了原生平台功能，并覆盖了101种语言，时间跨度为两年。主要发现包括比以往数据集更长的上下文窗口和更深入的交互。案例研究展示了数据集的广泛性和对用户-LLM聊天机器人交互研究的实用性。

Self-Routing RAG: Binding Selective Retrieval with Knowledge Verbalization

Authors: Di Wu, Jia-Chen Gu, Kai-Wei Chang, Nanyun Peng

First: 2025-04-01T17:59:30+00:00 · Latest: 2026-01-06T18:40:59+00:00

Abs · PDF · Code1 · Code2

Abstract

Selective retrieval aims to make retrieval-augmented generation (RAG) more efficient and reliable by skipping retrieval when an LLM's parametric knowledge suffices. Despite promising results, existing methods are constrained by a binary design choice: either retrieve from a single external source or skip retrieval and let the LLM directly produce the final answer. We argue that this fallback underestimates the model's knowledge and obscures the more general multi-source decision problem that arises in practical systems. We propose Self-Routing RAG (SR-RAG), which casts selective retrieval as knowledge source selection and treats the LLM itself as a first-class knowledge source. SR-RAG learns to select an appropriate knowledge source, optionally verbalize parametric knowledge, and answer using the selected source, all within a single left-to-right generation pass. SR-RAG further augments source selection by combining LLM-based uncertainty with a flexible external policy datastore to improve decision calibration. Across four benchmarks and three 7B-class LLMs, SR-RAG outperforms a strong selective retrieval baseline by 8.5%/2.1%/4.7% while performing 26%/40%/21% fewer retrievals, and it achieves favorable accuracy-latency trade-offs without dataset-specific threshold tuning.

中文标题/摘要

标题：自我路由RAG：选择性检索与知识口头表达的结合

选择性检索旨在通过在LLM的参数化知识足以时跳过检索来使检索增强生成（RAG）更加高效和可靠。尽管取得了令人鼓舞的结果，但现有方法受限于二元设计选择：要么从单一外部来源检索，要么跳过检索让LLM直接生成最终答案。我们认为这种退化低估了模型的知识，并掩盖了在实际系统中出现的更一般的多源决策问题。我们提出了自我路由RAG（SR-RAG），将其选择性检索视为知识来源选择，并将LLM本身视为一级知识来源。SR-RAG学习在单个从左到右的生成过程中选择适当的知识来源，可选地口头表达参数化知识，并使用选定的来源作答。SR-RAG进一步通过结合LLM基础的不确定性与灵活的外部策略数据存储来增强来源选择，以提高决策校准。在四个基准和三个7B级LLM上，SR-RAG在执行26%/40%/21%更少的检索的同时，比强选择性检索基线高出8.5%/2.1%/4.7%，并且在无需针对特定数据集进行阈值调整的情况下实现了有利的准确率-延迟权衡。

Summary / 总结

The research aims to improve the efficiency and reliability of retrieval-augmented generation (RAG) by allowing the model to selectively retrieve knowledge from external sources or use its internal knowledge. The proposed Self-Routing RAG (SR-RAG) learns to select the appropriate knowledge source and answer the query in a single generation pass, using a combination of LLM-based uncertainty and a flexible external policy datastore. SR-RAG outperforms existing selective retrieval methods by 8.5% to 4.7% across different benchmarks while reducing the number of retrievals by 26% to 40%. It also achieves better accuracy-latency trade-offs without requiring dataset-specific threshold tuning.

研究旨在通过让模型选择性地使用外部知识源或利用自身知识来提高检索增强生成（RAG）的效率和可靠性。方法是Self-Routing RAG（SR-RAG），它在一个生成过程中学习选择合适的知识源并生成答案，结合了LLM的不确定性与外部策略数据存储。关键发现表明，SR-RAG在不同基准测试中的表现优于强大的选择性检索基线8.5%到4.7%，同时减少了26%到40%的检索次数。此外，它在不需要特定数据集阈值调优的情况下实现了更好的准确性和延迟之间的权衡。

Kolmogorov-Arnold Energy Models: Fast and Interpretable Generative Modeling

Authors: Prithvi Raj

First: 2025-06-17T04:07:32+00:00 · Latest: 2026-01-06T18:32:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Learning an energy-based model (EBM) in the latent space of a top-down generative model offers a powerful framework for generation across many data modalities. However, it remains unclear how its interpretability can be used to guide model design, improve generative quality, and reduce training time. Moreover, the reliance on Langevin Monte Carlo (LMC) sampling presents challenges in efficiency and sampling multimodal latent distributions. We propose a novel adaptation of the Kolmogorov-Arnold representation theorem for generative modeling and introduce the Kolmogorov-Arnold Energy Model (KAEM) to take advantage of structural and inductive biases. By constraining the prior to univariate relationships, KAEM enables fast and exact inference via the inverse transform method. With the low dimensionality of the latent space and suitable inductive biases encoded, we demonstrate that importance sampling (IS) becomes a viable, unbiased, and highly efficient posterior sampler. For domains where IS fails, we introduce a strategy based on population-based LMC, decomposing the posterior into a sequence of annealed distributions to improve LMC mixing. KAEM balances common generative modeling trade-offs, offering fast inference, interpretability, and stable training, while being naturally suited to Zettascale Computing hardware.

中文标题/摘要

标题：柯尔莫哥洛夫-阿诺尔德能量模型：快速且可解释的生成建模

在顶层生成模型的潜在空间中学习能量基于模型（EBM）提供了一种强大的框架，可用于多种数据模态的生成。然而，其可解释性如何用于指导模型设计、提高生成质量并减少训练时间仍不清楚。此外，对拉angevin蒙特卡洛（LMC）采样的依赖性在效率和采样多模态潜在分布方面提出了挑战。我们提出了一种柯尔莫哥洛夫-阿诺尔德表示定理在生成建模中的新颖应用，并引入了柯尔莫哥洛夫-阿诺尔德能量模型（KAEM）以利用结构和归纳偏置。通过将先验约束为单变量关系，KAEM 通过反变换方法实现快速且精确的推理。凭借潜在空间的低维度和合适的归纳偏置编码，我们证明了重要性采样（IS）成为一种可行、无偏且高效的后验采样器。对于IS失败的领域，我们引入了一种基于群体的LMC策略，将后验分解为一系列退火分布以改善LMC混合。KAEM 平衡了常见的生成建模权衡，提供了快速推理、可解释性和稳定训练，同时自然适合Zettascale计算硬件。

Summary / 总结

Learning an energy-based model (EBM) in the latent space of a top-down generative model offers a powerful framework for generation across many data modalities.

研究旨在提高生成模型中能量基模型（EBM）的可解释性和效率。作者提出了Kolmogorov-Arnold Energy Model（KAEM），利用Kolmogorov-Arnold表示定理将先验约束为单变量关系，从而实现快速且精确的推理。KAEM 还引入了重要性采样（IS）作为高效的后验采样器，并在IS失败的情况下，提出了基于群体的Langevin Monte Carlo（LMC）策略以改善LMC的混合效果。该模型展示了快速推理、可解释性和稳定的训练性能，并且适合Zettascale Computing硬件。

MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents

Authors: Dongming Jiang, Yi Li, Guanpeng Li, Bingzhe Li

First: 2026-01-06T18:29:43+00:00 · Latest: 2026-01-06T18:29:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Memory-Augmented Generation (MAG) extends Large Language Models with external memory to support long-context reasoning, but existing approaches largely rely on semantic similarity over monolithic memory stores, entangling temporal, causal, and entity information. This design limits interpretability and alignment between query intent and retrieved evidence, leading to suboptimal reasoning accuracy. In this paper, we propose MAGMA, a multi-graph agentic memory architecture that represents each memory item across orthogonal semantic, temporal, causal, and entity graphs. MAGMA formulates retrieval as policy-guided traversal over these relational views, enabling query-adaptive selection and structured context construction. By decoupling memory representation from retrieval logic, MAGMA provides transparent reasoning paths and fine-grained control over retrieval. Experiments on LoCoMo and LongMemEval demonstrate that MAGMA consistently outperforms state-of-the-art agentic memory systems in long-horizon reasoning tasks.

中文标题/摘要

标题：MAGMA：基于多图的智能体记忆架构

记忆增强生成（MAG）通过外部记忆扩展大型语言模型，以支持长上下文推理，但现有方法主要依赖于单一记忆存储的语义相似性，将时间、因果和实体信息交织在一起。这种设计限制了可解释性和查询意图与检索证据之间的对齐，导致推理准确性不足。在本文中，我们提出了一种MAGMA多图智能体记忆架构，该架构将每个记忆项表示为语义、时间、因果和实体图的正交表示。MAGMA将检索形式化为受策略引导的这些关系视图上的遍历，从而实现查询自适应选择和结构化上下文构建。通过将记忆表示与检索逻辑解耦，MAGMA提供了透明的推理路径和对检索的精细控制。在LoCoMo和LongMemEval上的实验表明，MAGMA在长时域推理任务中始终优于最先进的智能体记忆系统。

Summary / 总结

MAGMA is a multi-graph based agentic memory architecture that addresses the limitations of existing Memory-Augmented Generation (MAG) systems by representing memory items across semantic, temporal, causal, and entity graphs. This design allows for policy-guided traversal and query-adaptive selection, leading to better reasoning accuracy in long-horizon tasks. Experiments show that MAGMA outperforms state-of-the-art agentic memory systems on LoCoMo and LongMemEval benchmarks.

MAGMA 是一种多图基于代理的记忆架构，通过在语义、时间、因果和实体图中表示记忆项来解决现有 Memory-Augmented Generation (MAG) 方法的局限性。这种设计允许策略引导的遍历和查询自适应选择，从而在长期任务推理中获得更好的准确性。实验表明，MAGMA 在 LoCoMo 和 LongMemEval 基准测试中优于最先进的代理记忆系统。

LTX-2: Efficient Joint Audio-Visual Foundation Model

Authors: Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, Eitan Richardson, Guy Shiran, Itay Chachy, Jonathan Chetboun, Michael Finkelson, Michael Kupchick, Nir Zabari, Nitzan Guetta, Noa Kotler, Ofir Bibi, Ori Gordon, Poriya Panet, Roi Benita, Shahar Armon, Victor Kulikov, Yaron Inger, Yonatan Shiftan, Zeev Melumian, Zeev Farbman

First: 2026-01-06T18:24:41+00:00 · Latest: 2026-01-06T18:24:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene -- complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.

中文标题/摘要

标题：LTX-2：高效联合音频-视觉基础模型

近期的文本到视频扩散模型可以生成引人入胜的视频序列，但它们仍然无声——缺少音频提供的语义、情感和氛围提示。我们引入了LTX-2，这是一个开源的基础模型，能够以统一的方式生成高质量、时间同步的音频-视觉内容。LTX-2 由一个不对称的双流变压器组成，包含一个140亿参数的视频流和一个50亿参数的音频流，通过双向音频-视频交叉注意力层和时间位置嵌入以及跨模态AdaLN进行耦合，以实现共享时间步长条件。这种架构使统一的音频-视觉模型的高效训练和推理成为可能，同时为视频生成分配了更多的容量，而音频生成则较少。我们使用多语言文本编码器以获得更广泛的提示理解，并引入了一种模态感知的无条件指导机制（模态-CFG），以提高音频-视觉对齐和可控性。除了生成语音，LTX-2 还生成丰富、连贯的音频轨道，跟随每个场景的角色、环境、风格和情感——包括自然的背景音和拟音元素。在我们的评估中，该模型在开源系统中实现了最先进的音频-视觉质量和提示一致性，同时以远低于专有模型的计算成本和推理时间提供类似的结果。所有模型权重和代码均已公开发布。

Summary / 总结

LTX-2 is designed to generate high-quality, synchronized audiovisual content by integrating text-to-video diffusion models with audio generation. It uses an asymmetric dual-stream transformer with a larger video stream and a smaller audio stream, coupled through bidirectional cross-attention layers. The model achieves state-of-the-art audiovisual quality and prompt adherence, with results comparable to proprietary models at lower computational cost. It produces rich, coherent audio tracks that match the characters, environment, style, and emotion of each scene, including natural background and foley elements.

LTX-2 是一个基础模型，旨在生成高质量、时间同步的音视频内容。它采用不对称的双流变压器，重点在于视频生成，并结合双向音视频交叉注意力层和模态感知无条件引导机制。该模型在音视频质量和指令一致性方面达到领先水平，计算效率与专有模型相当。主要发现包括生成丰富、连贯的音频轨道，能够匹配每个场景中的角色、环境、风格和情绪，并产生自然的背景音和效果音。所有模型权重和代码均已公开发布。

AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise

Authors: Tara Bogavelli, Roshnee Sharma, Hari Subramani

First: 2025-09-13T01:18:23+00:00 · Latest: 2026-01-06T18:18:48+00:00

Abs · PDF · Code1 · Code2

Abstract

While individual components of agentic architectures have been studied in isolation, there remains limited empirical understanding of how different design dimensions interact within complex multi-agent systems. This study aims to address these gaps by providing a comprehensive enterprise-specific benchmark evaluating 18 distinct agentic configurations across state-of-the-art large language models. We examine four critical agentic system dimensions: orchestration strategy, agent prompt implementation (ReAct versus function calling), memory architecture, and thinking tool integration. Our benchmark reveals significant model-specific architectural preferences that challenge the prevalent one-size-fits-all paradigm in agentic AI systems. It also reveals significant weaknesses in overall agentic performance on enterprise tasks with the highest scoring models achieving a maximum of only 35.3\% success on the more complex task and 70.8\% on the simpler task. We hope these findings inform the design of future agentic systems by enabling more empirically backed decisions regarding architectural components and model selection.

中文标题/摘要

标题：AgentArch：评估企业中代理架构的全面基准

尽管代理架构的各个组件已被单独研究，但对不同设计维度在复杂多代理系统中的相互作用仍缺乏有限的实证理解。本研究旨在通过提供一个针对18种不同代理配置的全面企业特定基准来填补这些空白，这些配置涵盖了最先进的大型语言模型。我们考察了四个关键的代理系统维度：编排策略、代理提示实现（ReAct与函数调用）、记忆架构以及思维工具集成。我们的基准揭示了显著的模型特定架构偏好，挑战了代理AI系统中普遍适用的一刀切范式。它还揭示了代理整体性能在企业任务中的显著弱点，最高得分为35.3%的成功率在更复杂的任务中，而在更简单的任务中为70.8%。我们希望这些发现能够通过使关于架构组件和模型选择的决策更具实证支持来指导未来代理系统的开发。

Summary / 总结

This study aims to evaluate how different design dimensions interact in complex multi-agent systems by providing a comprehensive enterprise-specific benchmark. It examines 18 distinct agentic configurations across state-of-the-art large language models, focusing on orchestration strategy, agent prompt implementation, memory architecture, and thinking tool integration. The benchmark highlights significant model-specific architectural preferences and reveals that even the highest-scoring models achieve only 35.3% success on complex tasks and 70.8% on simpler tasks in enterprise settings.

本研究旨在通过为企业量身定制的基准测试评估不同设计维度在复杂多智能体系统中的交互情况，该基准测试涵盖了18种不同的智能体配置，涉及最先进的大型语言模型。测试了四个关键维度：协调策略、智能体提示实现、记忆架构和思维工具集成。主要发现包括显著的模型特定架构偏好以及在企业任务中的显著性能不足，最高得分为35.3%的成功率完成复杂任务和70.8%的简单任务。

Multi-RADS Synthetic Radiology Report Dataset and Head-to-Head Benchmarking of 41 Open-Weight and Proprietary Language Models

Authors: Kartik Bose, Abhinandan Kumar, Raghuraman Soundararajan, Priya Mudgil, Samonee Ralmilay, Niharika Dutta, Manphool Singhal, Arun Kumar, Saugata Sen, Anurima Patra, Priya Ghosh, Abanti Das, Amit Gupta, Ashish Verma, Dipin Sudhakaran, Ekta Dhamija, Himangi Unde, Ishan Kumar, Krithika Rangarajan, Prerna Garg, Rachel Sequeira, Sudhin Shylendran, Taruna Yadav, Tej Pal, Pankaj Gupta

First: 2026-01-06T18:18:44+00:00 · Latest: 2026-01-06T18:18:44+00:00

Abs · PDF · Code1 · Code2

Abstract

Background: Reporting and Data Systems (RADS) standardize radiology risk communication but automated RADS assignment from narrative reports is challenging because of guideline complexity, output-format constraints, and limited benchmarking across RADS frameworks and model sizes. Purpose: To create RXL-RADSet, a radiologist-verified synthetic multi-RADS benchmark, and compare validity and accuracy of open-weight small language models (SLMs) with a proprietary model for RADS assignment. Materials and Methods: RXL-RADSet contains 1,600 synthetic radiology reports across 10 RADS (BI-RADS, CAD-RADS, GB-RADS, LI-RADS, Lung-RADS, NI-RADS, O-RADS, PI-RADS, TI-RADS, VI-RADS) and multiple modalities. Reports were generated by LLMs using scenario plans and simulated radiologist styles and underwent two-stage radiologist verification. We evaluated 41 quantized SLMs (12 families, 0.135-32B parameters) and GPT-5.2 under a fixed guided prompt. Primary endpoints were validity and accuracy; a secondary analysis compared guided versus zero-shot prompting. Results: Under guided prompting GPT-5.2 achieved 99.8% validity and 81.1% accuracy (1,600 predictions). Pooled SLMs (65,600 predictions) achieved 96.8% validity and 61.1% accuracy; top SLMs in the 20-32B range reached ~99% validity and mid-to-high 70% accuracy. Performance scaled with model size (inflection between <1B and >=10B) and declined with RADS complexity primarily due to classification difficulty rather than invalid outputs. Guided prompting improved validity (99.2% vs 96.7%) and accuracy (78.5% vs 69.6%) compared with zero-shot. Conclusion: RXL-RADSet provides a radiologist-verified multi-RADS benchmark; large SLMs (20-32B) can approach proprietary-model performance under guided prompting, but gaps remain for higher-complexity schemes.

Summary / 总结

The study aimed to create RXL-RADSet, a radiologist-verified synthetic multi-RADS benchmark, to evaluate the validity and accuracy of various language models for RADS assignment. The dataset includes 1,600 synthetic radiology reports across 10 RADS frameworks and multiple modalities. Evaluating 41 small language models and GPT-5.2, the study found that under guided prompting, GPT-5.2 achieved 99.8% validity and 81.1% accuracy, while pooled SLMs reached 96.8% validity and 61.1% accuracy. Performance improved with model size and guided prompting, but gaps remained for higher-complexity RADS schemes.

研究旨在创建RXL-RADSet，一个由放射科医生验证的多RADS合成基准，以评估开放权重的小语言模型（SLMs）和一个专有模型在RADS分配中的有效性和准确性。该数据集包含1,600份跨10个RADS框架的合成报告。评估结果显示，在引导提示下，GPT-5.2实现了99.8%的有效性和81.1%的准确性，而聚合的SLMs达到了96.8%的有效性和61.1%的准确性。性能随着模型大小的增加而提高，引导提示相比零样本提示提高了准确性。

The Sonar Moment: Benchmarking Audio-Language Models in Audio Geo-Localization

Authors: Ruixing Zhang, Zihan Liu, Leilei Sun, Tongyu Zhu, Weifeng Lv

First: 2026-01-06T18:13:24+00:00 · Latest: 2026-01-06T18:13:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Geo-localization aims to infer the geographic origin of a given signal. In computer vision, geo-localization has served as a demanding benchmark for compositional reasoning and is relevant to public safety. In contrast, progress on audio geo-localization has been constrained by the lack of high-quality audio-location pairs. To address this gap, we introduce AGL1K, the first audio geo-localization benchmark for audio language models (ALMs), spanning 72 countries and territories. To extract reliably localizable samples from a crowd-sourced platform, we propose the Audio Localizability metric that quantifies the informativeness of each recording, yielding 1,444 curated audio clips. Evaluations on 16 ALMs show that ALMs have emerged with audio geo-localization capability. We find that closed-source models substantially outperform open-source models, and that linguistic clues often dominate as a scaffold for prediction. We further analyze ALMs' reasoning traces, regional bias, error causes, and the interpretability of the localizability metric. Overall, AGL1K establishes a benchmark for audio geo-localization and may advance ALMs with better geospatial reasoning capability.

中文标题/摘要

标题：声纳时刻：音频语言模型在音频地理定位中的基准测试

地理定位旨在推断给定信号的地理来源。在计算机视觉中，地理定位已成为对组合推理能力的严苛基准测试，并与公共安全相关。相比之下，由于缺乏高质量的音频-位置配对，音频地理定位的进步受到限制。为了解决这一差距，我们引入了AGL1K，这是第一个面向音频语言模型（ALMs）的音频地理定位基准，覆盖了72个国家和地区。为了从众包平台中提取可靠可地理定位的样本，我们提出了音频地理定位度量，该度量量化了每个录音的信息量，生成了1,444个精选音频片段。对16个ALMs的评估显示，ALMs已经具备了音频地理定位的能力。我们发现，闭源模型显著优于开源模型，语言线索往往成为预测的主要支撑。我们进一步分析了ALMs的推理轨迹、区域偏见、错误原因以及地理定位度量的可解释性。总体而言，AGL1K为音频地理定位建立了基准，并可能促进具有更好地理空间推理能力的ALMs的发展。

Summary / 总结

The research aims to improve audio geo-localization by introducing AGL1K, the first benchmark for audio language models (ALMs), which includes 1,444 curated audio clips from 72 countries. The study evaluates 16 ALMs and finds that closed-source models perform better than open-source ones, with linguistic clues playing a significant role in predictions. The research also analyzes reasoning traces, regional biases, and error causes, enhancing the understanding of ALMs' geospatial reasoning capabilities.

研究旨在通过解决高质量音频-位置配对不足的问题来提升音频地理定位。它引入了AGL1K，这是一个包含来自72个国家的1,444个精选音频片段的基准，用于评估音频语言模型（ALMs）。对16种ALMs的评估显示，闭源模型的表现优于开源模型，语言线索往往是预测的关键。研究还分析了ALMs的推理过程和区域偏见，有助于开发具有更强地理空间推理能力的ALMs。

The Fake Friend Dilemma: Trust and the Political Economy of Conversational AI

Authors: Jacob Erickson

First: 2026-01-06T18:07:52+00:00 · Latest: 2026-01-06T18:07:52+00:00

Comments: Manuscript under review

Abs · PDF · Code1 · Code2

Abstract

As conversational AI systems become increasingly integrated into everyday life, they raise pressing concerns about user autonomy, trust, and the commercial interests that influence their behavior. To address these concerns, this paper develops the Fake Friend Dilemma (FFD), a sociotechnical condition in which users place trust in AI agents that appear supportive while pursuing goals that are misaligned with the user's own. The FFD provides a critical framework for examining how anthropomorphic AI systems facilitate subtle forms of manipulation and exploitation. Drawing on literature in trust, AI alignment, and surveillance capitalism, we construct a typology of harms, including covert advertising, political propaganda, behavioral nudging, and surveillance. We then assess possible mitigation strategies, including both structural and technical interventions. By focusing on trust as a vector of asymmetrical power, the FFD offers a lens for understanding how AI systems may undermine user autonomy while maintaining the appearance of helpfulness.

中文标题/摘要

标题：假朋友困境：信任与对话型AI的政治经济

随着对话型AI系统越来越多地融入日常生活，它们引发了关于用户自主权、信任以及影响其行为的商业利益的紧迫关切。为应对这些关切，本文提出了假朋友困境（FFD）这一社会技术条件，即用户对看似支持自己的AI代理产生信任，而这些代理的目标与用户自身的目标不一致。FFD提供了一个批判性的框架，用于考察拟人化AI系统如何促进微妙形式的操控和剥削。基于信任、AI对齐和监视资本主义的相关文献，我们构建了包括隐蔽广告、政治宣传、行为引导和监视在内的危害类型。然后评估了可能的缓解策略，包括结构和技术干预。通过将信任视为不对称权力的向量，FFD提供了一个理解AI系统如何在保持乐于助人的表象的同时削弱用户自主权的视角。

Summary / 总结

This paper addresses the concerns about user trust and autonomy in conversational AI systems by introducing the Fake Friend Dilemma (FFD), a sociotechnical condition where users trust AI agents that may have misaligned goals. The FFD framework examines how anthropomorphic AI can subtly manipulate users through covert advertising, political propaganda, behavioral nudging, and surveillance. The study constructs a typology of these harms and proposes mitigation strategies, emphasizing the importance of trust as a vector of asymmetrical power in AI systems.

本文通过引入“假朋友困境”（FFD）框架，探讨了对话式AI系统中用户信任和自主权的问题，该框架分析了看似支持的AI代理如何追求与用户相悖的目标。FFD框架识别了各种形式的危害，如隐蔽广告、政治宣传、行为引导和监视。研究还提出了结构和技术创新的缓解策略，以解决这些问题。

MalruleLib: Large-Scale Executable Misconception Reasoning with Step Traces for Modeling Student Thinking in Mathematics

Authors: Xinghe Chen, Naiming Liu, Shashank Sonkar

First: 2026-01-06T17:59:37+00:00 · Latest: 2026-01-06T17:59:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Student mistakes in mathematics are often systematic: a learner applies a coherent but wrong procedure and repeats it across contexts. We introduce MalruleLib, a learning-science-grounded framework that translates documented misconceptions into executable procedures, drawing on 67 learning-science and mathematics education sources, and generates step-by-step traces of malrule-consistent student work. We formalize a core student-modeling problem as Malrule Reasoning Accuracy (MRA): infer a misconception from one worked mistake and predict the student's next answer under cross-template rephrasing. Across nine language models (4B-120B), accuracy drops from 66% on direct problem solving to 40% on cross-template misconception prediction. MalruleLib encodes 101 malrules over 498 parameterized problem templates and produces paired dual-path traces for both correct reasoning and malrule-consistent student reasoning. Because malrules are executable and templates are parameterizable, MalruleLib can generate over one million instances, enabling scalable supervision and controlled evaluation. Using MalruleLib, we observe cross-template degradations of 10-21%, while providing student step traces improves prediction by 3-15%. We release MalruleLib as infrastructure for educational AI that models student procedures across contexts, enabling diagnosis and feedback that targets the underlying misconception.

中文标题/摘要

标题：MalruleLib：大规模可执行误解推理框架及其在数学中建模学生思维的应用

数学中的学生错误往往是系统性的：学习者应用一种连贯但错误的程序，并在不同情境中重复使用。我们引入了MalruleLib，这是一种基于学习科学的框架，将记录的误解转化为可执行的程序，借鉴了67份学习科学和数学教育资料，并生成了与误解一致的学生工作的逐步步骤。我们将核心的学生建模问题形式化为误解推理准确性（MRA）：从一个已解决的错误中推断出误解，并预测在跨模板重述下学生的下一个答案。在九种语言模型（4B-120B）中，直接问题解决的准确性从66%下降到跨模板误解预测的40%。MalruleLib 编码了101种误解，覆盖了498个参数化问题模板，并为正确推理和误解一致的学生推理生成了配对的双路径轨迹。由于误解是可执行的，模板是可参数化的，MalruleLib 可以生成超过一百万实例，从而实现大规模监督和可控评估。使用MalruleLib，我们观察到跨模板的降级幅度为10-21%，而提供学生步骤轨迹可以提高预测3-15%。我们以MalruleLib 作为教育AI的基础架构，使其能够跨情境建模学生程序，从而实现针对潜在误解的诊断和反馈。

Summary / 总结

MalruleLib is a framework that translates documented misconceptions into executable procedures to model student thinking in mathematics. It generates step-by-step traces for malrule-consistent student work and formalizes a core student-modeling problem called Malrule Reasoning Accuracy (MRA). Across nine language models, MRA accuracy drops significantly from 66% on direct problem solving to 40% on cross-template misconception prediction. MalruleLib enables scalable supervision and controlled evaluation by encoding 101 malrules over 498 parameterized problem templates, generating over one million instances. Using MalruleLib, cross-template degradations of 10-21% are observed, and providing student step traces improves prediction by 3-15%.

MalruleLib 是一个框架，将记录的误解转化为可执行的过程来建模学生的数学思维。它生成了与误解一致的学生工作的逐步痕迹，并将核心学生建模问题形式化为误解推理准确性（MRA）。在九个语言模型中，MRA 的准确性从直接问题解决的 66% 降至跨模板误解预测的 40%。MalruleLib 通过编码 101 个误解和 498 个参数化问题模板，生成超过一百万个实例，从而实现大规模监督和可控评估。使用 MalruleLib，观察到 10-21% 的跨模板退化，提供学生步骤痕迹可提高预测准确性 3-15%。

Adapting Web Agents with Synthetic Supervision

Authors: Zhaoyang Wang, Yiming Liang, Xuchao Zhang, Qianhui Wu, Siwei Han, Anson Bastos, Rujia Wang, Chetan Bansal, Baolin Peng, Jianfeng Gao, Saravan Rajmohan, Huaxiu Yao

First: 2025-11-08T18:45:33+00:00 · Latest: 2026-01-06T17:55:17+00:00

Comments: 21 pages, 6 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Web agents struggle to adapt to new websites due to the scarcity of environment specific tasks and demonstrations. Recent works have explored synthetic data generation to address this challenge, however, they suffer from data quality issues where synthesized tasks contain hallucinations that cannot be executed, and collected trajectories are noisy with redundant or misaligned actions. In this paper, we propose SynthAgent, a fully synthetic supervision framework that aims at improving synthetic data quality via dual refinement of both tasks and trajectories. Our approach begins by synthesizing diverse tasks through categorized exploration of web elements, ensuring efficient coverage of the target environment. During trajectory collection, tasks are refined only when conflicts with observations are detected, which mitigates hallucinations while preserving task consistency. After collection, we conduct trajectory refinement with global context to mitigate potential noise or misalignments. Finally, we fine-tune open-source web agents on the refined synthetic data to adapt them to the target environment. Experimental results demonstrate that SynthAgent outperforms existing synthetic data methods, validating the importance of high-quality synthetic supervision. The code is publicly available at https://github.com/aiming-lab/SynthAgent.

中文标题/摘要

标题：利用合成监督适应网络代理

网络代理因环境特定任务和示范稀缺而难以适应新网站。近期研究探索了合成数据生成以应对这一挑战，但这些方法存在数据质量问题，合成任务中包含无法执行的幻觉，收集的轨迹数据也存在冗余或对齐错误。本文提出了一种名为SynthAgent的完全合成监督框架，旨在通过任务和轨迹的双重精炼来提高合成数据质量。该方法首先通过分类探索网页元素来合成多样化的任务，确保高效覆盖目标环境。在轨迹收集过程中，仅在与观察结果发生冲突时才对任务进行精炼，这可以减轻幻觉现象并保持任务一致性。收集后，我们使用全局上下文对轨迹进行精炼，以减轻潜在的噪声或对齐错误。最后，我们对精炼后的合成数据进行微调，以使网络代理适应目标环境。实验结果表明，SynthAgent优于现有的合成数据方法，验证了高质量合成监督的重要性。代码可在https://github.com/aiming-lab/SynthAgent公开获取。

Summary / 总结

The paper addresses the challenge of web agents adapting to new websites by proposing SynthAgent, a framework that improves synthetic data quality through dual refinement of tasks and trajectories. It synthesizes diverse tasks through categorized exploration and refines them only when conflicts with observations are detected, ensuring task consistency. Trajectories are further refined with global context to mitigate noise. The approach is validated through experiments that show better performance compared to existing methods, highlighting the importance of high-quality synthetic supervision.

本文提出了一种名为SynthAgent的框架，通过同时改进任务和轨迹的质量来解决网络代理适应新网站的挑战。该框架通过探索网页元素来生成多样化的任务，并仅在检测到与观察冲突时才精炼轨迹，以保持任务一致性。收集后，使用全局上下文进一步精炼轨迹以减轻噪声。经过微调的网络代理在性能上优于现有方法，验证了高质量合成监督的重要性。

Critic-Guided Reinforcement Unlearning in Text-to-Image Diffusion

Authors: Mykola Vysotskyi, Zahar Kohut, Mariia Shpir, Taras Rumezhak, Volodymyr Karpiv

Venue: ICLR 2026

First: 2026-01-06T17:52:02+00:00 · Latest: 2026-01-06T17:52:02+00:00

Comments: Preprint. Under review at ICLR 2026

Abs · PDF · Code1 · Code2

Abstract

Machine unlearning in text-to-image diffusion models aims to remove targeted concepts while preserving overall utility. Prior diffusion unlearning methods typically rely on supervised weight edits or global penalties; reinforcement-learning (RL) approaches, while flexible, often optimize sparse end-of-trajectory rewards, yielding high-variance updates and weak credit assignment. We present a general RL framework for diffusion unlearning that treats denoising as a sequential decision process and introduces a timestep-aware critic with noisy-step rewards. Concretely, we train a CLIP-based reward predictor on noisy latents and use its per-step signal to compute advantage estimates for policy-gradient updates of the reverse diffusion kernel. Our algorithm is simple to implement, supports off-policy reuse, and plugs into standard text-to-image backbones. Across multiple concepts, the method achieves better or comparable forgetting to strong baselines while maintaining image quality and benign prompt fidelity; ablations show that (i) per-step critics and (ii) noisy-conditioned rewards are key to stability and effectiveness. We release code and evaluation scripts to facilitate reproducibility and future research on RL-based diffusion unlearning.

中文标题/摘要

标题：文本到图像扩散中的评论者引导强化遗忘

文本到图像扩散模型中的机器遗忘旨在移除目标概念同时保持整体实用性。先前的扩散遗忘方法通常依赖于监督权重编辑或全局惩罚；强化学习（RL）方法虽然灵活，但通常优化稀疏的轨迹末尾奖励，导致高方差更新和弱的信用分配。我们提出了一种通用的RL框架，将去噪视为一个顺序决策过程，并引入了具有噪声步奖励的时间步感知评论者。具体地，我们使用CLIP基线奖励预测器在噪声潜变量上进行训练，并使用其每步信号来计算策略梯度更新逆向扩散核的优势估计。我们的算法易于实现，支持离策重用，并可插入标准文本到图像骨干。在多个概念上，该方法在遗忘效果上优于或与强大的基线相当，同时保持图像质量和良性提示保真度；消融实验表明，（i）每步评论者和（ii）噪声条件奖励是稳定性和有效性的重要因素。我们发布了代码和评估脚本来促进可重复性和基于RL的扩散遗忘的未来研究。

Summary / 总结

The research aims to improve the ability of text-to-image diffusion models to remove targeted concepts without degrading overall image quality. It introduces a reinforcement-learning framework that treats denoising as a sequential decision process, using a timestep-aware critic with noisy-step rewards. The method outperforms strong baselines in forgetting targeted concepts while maintaining image quality and prompt fidelity, with ablations showing the importance of per-step critics and noisy-conditioned rewards for stability and effectiveness.

研究旨在通过去除目标概念同时保持图像质量和提示保真度来改进文本到图像扩散模型中的机器遗忘。方法使用强化学习框架，将去噪视为一个顺序决策过程，并引入一个时间步感知的批评家和噪声步奖励。基于CLIP的奖励预测器，这种方法在多个概念上实现了与强基线相当或更好的遗忘效果，同时保持图像质量和提示保真度。消融研究强调了每步批评家和噪声条件奖励对于稳定性和有效性的重要性。

LVLM-Aware Multimodal Retrieval for RAG-Based Medical Diagnosis with General-Purpose Models

Authors: Nir Mazor, Tom Hope

First: 2025-08-24T15:06:20+00:00 · Latest: 2026-01-06T17:49:44+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Retrieving visual and textual information from medical literature and hospital records can enhance diagnostic accuracy for clinical image interpretation. However, multimodal retrieval-augmented diagnosis is highly challenging. We explore a lightweight mechanism for enhancing diagnostic performance of retrieval-augmented LVLMs. We train a lightweight LVLM-aware multimodal retriever, such that the retriever learns to return images and texts that guide the LVLM toward correct predictions. In our low-resource setting, we perform only lightweight fine-tuning with small amounts of data, and use only general-purpose backbone models, achieving competitive results in clinical classification and VQA tasks compared to medically pre-trained models with extensive training. In a novel analysis, we highlight a previously unexplored class of errors that we term inconsistent retrieval predictions: cases where different top-retrieved images yield different predictions for the same target. We find that these cases are challenging for all models, even for non-retrieval models, and that our retrieval optimization mechanism significantly improves these cases over standard RAG. However, our analysis also sheds light on gaps in the ability of LVLMs to utilize retrieved information for clinical predictions. Code and models available at: https://github.com/Nirmaz/JOMED.

中文标题/摘要

标题：LVLM-感知多模态检索在基于RAG的医疗诊断中的应用与通用模型

从医学文献和医院记录中检索视觉和文本信息可以提高临床图像解释的诊断准确性。然而，多模态检索增强的诊断极具挑战性。我们探索了一种轻量级机制，以增强检索增强LVLM的诊断性能。我们训练了一个轻量级的LVLM感知多模态检索器，使得检索器学会返回能够引导LVLM做出正确预测的图像和文本。在我们的低资源设置中，我们仅使用少量数据进行轻量级微调，并仅使用通用基础模型，与广泛训练的医学预训练模型相比，在临床分类和VQA任务中取得了竞争力的结果。在一项新的分析中，我们强调了一类以前未被探索的错误，我们称之为不一致的检索预测：不同检索出的图像对同一目标产生不同预测的情况。我们发现，这些情况对所有模型来说都是具有挑战性的，即使是非检索模型，而我们的检索优化机制在这些情况下显著优于标准的RAG。然而，我们的分析也揭示了LVLM在利用检索信息进行临床预测方面的能力缺口。代码和模型可在：https://github.com/Nirmaz/JOMED/ 获取。

Summary / 总结

The research aims to improve diagnostic accuracy in clinical image interpretation by integrating visual and textual information from medical literature and records. The study employs a lightweight LVLM-aware multimodal retriever to enhance the performance of retrieval-augmented language models. Experimental results show that the proposed method achieves competitive results in clinical classification and VQA tasks with limited fine-tuning data, outperforming standard RAG methods in handling inconsistent retrieval predictions. However, the analysis also reveals limitations in how LVLMs utilize retrieved information for clinical predictions.

研究旨在通过整合医学文献和记录中的视觉和文本信息来提高临床图像解释的诊断准确性。研究采用了一种轻量级的LVLM感知多模态检索器，以增强检索增强语言模型的性能。实验结果表明，所提出的方法在有限的微调数据下，在临床分类和VQA任务中取得了竞争力的结果，并在处理不一致的检索预测方面优于标准RAG方法。然而，分析还揭示了LVLM在利用检索信息进行临床预测方面的局限性。

Fine-tuning Small Language Models as Efficient Enterprise Search Relevance Labelers

Authors: Yue Kang, Zhuoyi Huang, Benji Schussheim, Diana Licon, Dina Atia, Shixing Cao, Jacob Danovitch, Kunho Kim, Billy Norcilien, Jonah Karpman, Mahmound Sayed, Mike Taylor, Tao Sun, Pavel Metrikov, Vipul Agarwal, Chris Quirk, Ye-Yi Wang, Nick Craswell, Irene Shaffer, Tianwei Chen, Sulaiman Vesal, Soundar Srinivasan

First: 2026-01-06T17:48:40+00:00 · Latest: 2026-01-06T17:48:40+00:00

Abs · PDF · Code1 · Code2

Abstract

In enterprise search, building high-quality datasets at scale remains a central challenge due to the difficulty of acquiring labeled data. To resolve this challenge, we propose an efficient approach to fine-tune small language models (SLMs) for accurate relevance labeling, enabling high-throughput, domain-specific labeling comparable or even better in quality to that of state-of-the-art large language models (LLMs). To overcome the lack of high-quality and accessible datasets in the enterprise domain, our method leverages on synthetic data generation. Specifically, we employ an LLM to synthesize realistic enterprise queries from a seed document, apply BM25 to retrieve hard negatives, and use a teacher LLM to assign relevance scores. The resulting dataset is then distilled into an SLM, producing a compact relevance labeler. We evaluate our approach on a high-quality benchmark consisting of 923 enterprise query-document pairs annotated by trained human annotators, and show that the distilled SLM achieves agreement with human judgments on par with or better than the teacher LLM. Furthermore, our fine-tuned labeler substantially improves throughput, achieving 17 times increase while also being 19 times more cost-effective. This approach enables scalable and cost-effective relevance labeling for enterprise-scale retrieval applications, supporting rapid offline evaluation and iteration in real-world settings.

中文标题/摘要

标题：微调小型语言模型作为高效的企业搜索相关性标注器

在企业搜索中，由于难以获取高质量的标注数据，大规模构建高质量数据集仍然是一个核心挑战。为解决这一挑战，我们提出了一种有效的方法，通过微调小型语言模型（SLMs）来进行准确的相关性标注，从而实现高通量、领域特定的标注，其质量和最先进的大型语言模型（LLMs）相当甚至更好。为了克服企业领域中高质量和可访问数据集的缺乏，我们的方法利用合成数据生成。具体来说，我们使用LLM从种子文档生成现实的企业查询，使用BM25检索困难的负样本，并使用教师LLM分配相关性评分。生成的数据集随后被提炼成SLM，产生一个紧凑的相关性标注器。我们在一个由923个企业查询-文档对组成、由训练有素的人标注的高质量基准上评估了我们的方法，并展示了微调后的SLM在与人类判断的一致性上与教师LLM相当或更好。此外，我们的标注器显著提高了吞吐量，实现了17倍的提升，同时成本效益提高了19倍。这种方法使大规模检索应用中的相关性标注变得可扩展且成本效益高，支持在实际场景中的快速离线评估和迭代。

Summary / 总结

The research aims to address the challenge of building high-quality datasets for enterprise search by proposing a method to fine-tune small language models for accurate relevance labeling. This method uses synthetic data generation with an LLM to create realistic enterprise queries, retrieve hard negatives using BM25, and assign relevance scores by a teacher LLM. The resulting dataset is distilled into an SLM, which is then used as a compact relevance labeler. The approach achieves agreement with human judgments comparable to or better than the teacher LLM and significantly improves throughput and cost-effectiveness, enabling scalable and cost-effective relevance labeling for enterprise-scale retrieval applications.

论文通过提出一种方法，利用小语言模型（SLM）进行准确的相关性标注，以解决构建企业搜索高质量数据集的挑战。该方法使用合成数据生成，包括使用LLM生成真实的企业查询，BM25检索负样本，以及使用教师LLM分配相关性评分。经过提炼的SLM在与人工判断的一致性方面与教师LLM相当或更好，同时提高了17倍的吞吐量并降低了19倍的成本。这种方法支持企业规模检索应用中的可扩展和低成本的相关性标注，支持实际场景中的快速离线评估和迭代。

UltraLogic: Enhancing LLM Reasoning through Large-Scale Data Synthesis and Bipolar Float Reward

Authors: Yile Liu, Yixian Liu, Zongwei Li, Yufei Huang, Xinhua Feng, Zhichao Hu, Jinglu Hu, Jianfeng Yan, Fengzong Lian, Yuhong Liu

First: 2026-01-06T17:41:32+00:00 · Latest: 2026-01-06T17:41:32+00:00

Comments: 19 pages, 6 figures, 7 tables

Abs · PDF · Code1 · Code2

Abstract

While Large Language Models (LLMs) have demonstrated significant potential in natural language processing , complex general-purpose reasoning requiring multi-step logic, planning, and verification remains a critical bottleneck. Although Reinforcement Learning with Verifiable Rewards (RLVR) has succeeded in specific domains , the field lacks large-scale, high-quality, and difficulty-calibrated data for general reasoning. To address this, we propose UltraLogic, a framework that decouples the logical core of a problem from its natural language expression through a Code-based Solving methodology to automate high-quality data production. The framework comprises hundreds of unique task types and an automated calibration pipeline across ten difficulty levels. Furthermore, to mitigate binary reward sparsity and the Non-negative Reward Trap, we introduce the Bipolar Float Reward (BFR) mechanism, utilizing graded penalties to effectively distinguish perfect responses from those with logical flaws. Our experiments demonstrate that task diversity is the primary driver for reasoning enhancement , and that BFR, combined with a difficulty matching strategy, significantly improves training efficiency, guiding models toward global logical optima.

中文标题/摘要

标题：UltraLogic：通过大规模数据合成和双极浮点奖励提升LLM推理能力

尽管大型语言模型（LLMs）在自然语言处理方面展现了显著潜力，但在多步逻辑、规划和验证等复杂通用推理方面仍存在关键瓶颈。尽管可验证奖励强化学习（RLVR）在特定领域取得了成功，但该领域缺乏大规模、高质量且难度校准的数据以支持通用推理。为解决这一问题，我们提出了UltraLogic框架，该框架通过基于代码的求解方法将问题的逻辑核心与其自然语言表达分离开来，以自动化生产高质量数据。该框架包括数百种独特的任务类型，并且在十个难度级别上具有自动校准流水线。此外，为缓解二元奖励稀疏性和非负奖励陷阱，我们引入了双极浮点奖励（BFR）机制，利用分级惩罚来有效区分完美响应与逻辑错误的响应。我们的实验表明，任务多样性是推理提升的主要驱动力，而BFR与难度匹配策略相结合，显著提高了训练效率，引导模型向全局逻辑最优解发展。

Summary / 总结

UltraLogic aims to enhance LLM reasoning by synthesizing large-scale, high-quality data and using a Bipolar Float Reward mechanism. The framework uses a Code-based Solving methodology to automate data production for various task types and difficulty levels. Experiments show that task diversity is crucial for reasoning improvement, and the Bipolar Float Reward, combined with difficulty matching, boosts training efficiency and guides models to global logical optima.

UltraLogic旨在通过合成大规模高质量数据和使用双极浮点奖励机制来增强LLM的推理能力。该框架采用基于代码的求解方法来自动化生成各种任务类型和难度级别的数据。实验表明，任务多样性是推理改进的关键，而双极浮点奖励与难度匹配策略结合使用可以提高训练效率，引导模型达到全局逻辑最优。

InfiAgent: An Infinite-Horizon Framework for General-Purpose Autonomous Agents

Authors: Chenglin Yu, Yuchen Wang, Songmiao Wang, Hongxia Yang, Ming Li

First: 2026-01-06T17:35:57+00:00 · Latest: 2026-01-06T17:35:57+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

LLM agents can reason and use tools, but they often break down on long-horizon tasks due to unbounded context growth and accumulated errors. Common remedies such as context compression or retrieval-augmented prompting introduce trade-offs between information fidelity and reasoning stability. We present InfiAgent, a general-purpose framework that keeps the agent's reasoning context strictly bounded regardless of task duration by externalizing persistent state into a file-centric state abstraction. At each step, the agent reconstructs context from a workspace state snapshot plus a fixed window of recent actions. Experiments on DeepResearch and an 80-paper literature review task show that, without task-specific fine-tuning, InfiAgent with a 20B open-source model is competitive with larger proprietary systems and maintains substantially higher long-horizon coverage than context-centric baselines. These results support explicit state externalization as a practical foundation for stable long-horizon agents. Github Repo:https://github.com/ChenglinPoly/infiAgent

中文标题/摘要

标题：InfiAgent：通用自主代理的无限时域框架

LLM代理可以推理和使用工具，但在长时域任务中由于上下文增长无界和累积错误常常会失效。常见的补救措施如上下文压缩或检索增强提示会在信息保真度和推理稳定性之间引入权衡。我们提出了InfiAgent，这是一种通用框架，通过将持久状态外部化到文件中心的状态抽象，使代理的推理上下文严格保持在有限范围内，无论任务持续时间如何。在每一步，代理从工作区状态快照和最近的固定窗口动作重建上下文。在DeepResearch和80篇论文的文献回顾任务上的实验表明，无需针对特定任务进行微调，使用20B开源模型的InfiAgent与更大规模的专有系统具有竞争力，并且在长时域覆盖方面显著优于以上下文为中心的基线。这些结果支持显式状态外部化作为稳定长时域代理的实用基础。Github仓库：https://github.com/ChenglinPoly/infiAgent

Summary / 总结

InfiAgent is a framework designed to address the challenges of long-horizon tasks for LLM agents by keeping the reasoning context bounded through external state storage. It reconstructs context from a snapshot of the workspace state and a fixed window of recent actions. Experiments show that InfiAgent, using a 20B open-source model, performs competitively with larger proprietary systems and outperforms context-centric baselines in maintaining long-horizon coverage without task-specific fine-tuning. This suggests that explicit state externalization is a practical approach for stable long-horizon agents.

InfiAgent 是一个框架，旨在通过外部状态存储来解决 LLM 代理在长期任务中的挑战，保持推理上下文的边界。它通过工作空间状态的快照和最近动作的固定窗口来重构上下文。实验表明，使用 20B 开源模型的 InfiAgent 在长期任务覆盖方面与更大规模的专有系统竞争，并且在不需要任务特定微调的情况下，优于基于上下文的基线。这表明显式状态外部化是稳定长期任务代理的一个实用基础。

The Journal of Prompt-Engineered Philosophy Or: How I Started to Track AI Assistance and Stopped Worrying About Slop

Authors: Michele Loi

First: 2025-11-10T08:56:21+00:00 · Latest: 2026-01-06T17:29:26+00:00

Comments: 44 pages (30 Article + 14 Appendix); 2 figures Transparency material documenting LLM usage available at: https://github.com/MicheleLoi/JPEP/tree/main/transparency/Canonical_MD

Abs · PDF · Code1 · Code2 · Code3

Abstract

Academic publishing increasingly requires authors to disclose AI assistance, yet imposes reputational costs for doing so--especially when such assistance is substantial. This article analyzes that structural contradiction, showing how incentives discourage transparency in precisely the work where it matters most. Traditional venues cannot resolve this tension through policy tweaks alone, as the underlying prestige economy rewards opacity. To address this, the article proposes an alternative publishing infrastructure: a venue outside prestige systems that enforces mandatory disclosure, enables reproduction-based review, and supports ecological validity through detailed documentation. As a demonstration of this approach, the article itself is presented as an example of AI-assisted scholarship under reasonably detailed disclosure, with representative prompt logs and modification records included. Rather than taking a position for or against AI-assisted scholarship, the article outlines conditions under which such work can be evaluated on its own terms: through transparent documentation, verification-oriented review, and participation by methodologically committed scholars. While focused on AI, the framework speaks to broader questions about how academic systems handle methodological innovation.

中文标题/摘要

标题：《指令工程化哲学学报》或：我如何开始追踪AI辅助并停止担心杂乱

学术出版越来越多地要求作者披露AI辅助，但同时又对这样做施加声誉成本——尤其是在辅助作用显著的情况下。本文分析了这种结构性矛盾，展示了激励措施如何在最需要透明度的工作中抑制透明度。传统出版渠道仅通过政策调整无法解决这一紧张关系，因为潜在的声望经济奖励不透明。为解决这一问题，本文提出了一种替代的出版基础设施：一种位于声望系统之外的场所，强制披露，支持基于再现的审查，并通过详细的文档支持生态有效性。作为这一方法的示范，本文本身被呈现为合理详细披露下的AI辅助研究示例，附有代表性的LLM使用记录和修改记录。本文并未站在支持或反对AI辅助研究的立场上，而是概述了在这种研究可以独立评估的条件：通过透明的文档、验证导向的审查和方法论承诺学者的参与。虽然重点是AI，但该框架也涉及更广泛的问题，即学术系统如何处理方法论创新。

Summary / 总结

This paper explores the structural contradiction in academic publishing where there is a growing requirement for authors to disclose AI assistance, but doing so incurs reputational costs. The author proposes an alternative publishing infrastructure that enforces mandatory disclosure, supports reproduction-based review, and promotes ecological validity through detailed documentation. The article itself serves as a demonstration of AI-assisted scholarship with detailed disclosure, including prompt logs and modification records. The framework aims to evaluate AI-assisted work transparently and methodologically, addressing broader questions about academic systems handling methodological innovation.

本文探讨了学术出版中披露AI辅助要求与相关声誉成本之间的矛盾。提出了一种新的出版基础设施，强制披露、支持基于再现的审查，并通过详细记录促进生态有效性。文章本身作为AI辅助研究的示范，包括详细的披露记录、提示日志和修改记录，并概述了透明和方法论上评估此类工作的条件。

DIP: Dynamic In-Context Planner For Diffusion Language Models

Authors: Yang Li, Han Meng, Chenan Wang, Haipeng Chen

First: 2026-01-06T17:24:16+00:00 · Latest: 2026-01-06T17:24:16+00:00

Comments: 4 pages

Abs · PDF · Code1 · Code2

Abstract

Diffusion language models (DLMs) have shown strong potential for general natural language tasks with in-context examples. However, due to the bidirectional attention mechanism, DLMs incur substantial computational cost as context length increases. This work addresses this issue with a key discovery: unlike the sequential generation in autoregressive language models (ARLMs), the diffusion generation paradigm in DLMs allows \textit{efficient dynamic adjustment of the context} during generation. Building on this insight, we propose \textbf{D}ynamic \textbf{I}n-Context \textbf{P}lanner (DIP), a context-optimization method that dynamically selects and inserts in-context examples during generation, rather than providing all examples in the prompt upfront. Results show DIP maintains generation quality while achieving up to 12.9$\times$ inference speedup over standard inference and 1.17$\times$ over KV cache-enhanced inference.

中文标题/摘要

标题：DIP：动态上下文规划器用于扩散语言模型

扩散语言模型（DLMs）在使用上下文示例的情况下展示了强大的通用自然语言处理潜力。然而，由于双向注意力机制，DLMs在上下文长度增加时会带来巨大的计算成本。这项工作通过一个关键发现解决了这一问题：与自回归语言模型（ARLMs）的顺序生成不同，DLMs的扩散生成范式允许在生成过程中进行\textit{高效的动态上下文调整}。基于这一洞察，我们提出了\textbf{D}动态\textbf{I}上下文\textbf{P}规划器（DIP），这是一种上下文优化方法，在生成过程中动态选择和插入上下文示例，而不是在提示中一次性提供所有示例。结果显示，DIP在保持生成质量的同时，相对于标准推理实现了高达12.9$\times$的推理加速，相对于KV缓存增强的推理实现了1.17$\times$的加速。

Summary / 总结

This paper addresses the computational cost issue in diffusion language models (DLMs) as context length increases. It proposes DIP, a dynamic in-context planner that allows efficient adjustment of context during generation, unlike autoregressive language models. DIP dynamically selects and inserts in-context examples, leading to up to 12.9 times faster inference while maintaining generation quality. Compared to KV cache-enhanced inference, DIP shows a 1.17 times speedup.

该研究解决了扩散语言模型（DLMs）随上下文长度增加而产生的计算成本问题。提出了动态上下文规划器（DIP），允许在生成过程中高效调整上下文，不同于自回归语言模型。DIP动态选择并插入上下文示例，相比标准推理速度提升高达12.9倍，比带有KV缓存的增强推理快1.17倍，同时保持了生成质量。

Empowering Reliable Visual-Centric Instruction Following in MLLMs

Authors: Weilei He, Feng Ju, Zhiyuan Fan, Rui Min, Minhao Cheng, Yi R. Fung

First: 2026-01-06T17:23:33+00:00 · Latest: 2026-01-06T17:23:33+00:00

Comments: Submitted to ARR Jan

Abs · PDF · Code1 · Code2

Abstract

Evaluating the instruction-following (IF) capabilities of Multimodal Large Language Models (MLLMs) is essential for rigorously assessing how faithfully model outputs adhere to user-specified intentions. Nevertheless, existing benchmarks for evaluating MLLMs' instruction-following capability primarily focus on verbal instructions in the textual modality. These limitations hinder a thorough analysis of instruction-following capabilities, as they overlook the implicit constraints embedded in the semantically rich visual modality. To address this gap, we introduce VC-IFEval, a new benchmark accompanied by a systematically constructed dataset that evaluates MLLMs' instruction-following ability under multimodal settings. Our benchmark systematically incorporates vision-dependent constraints into instruction design, enabling a more rigorous and fine-grained assessment of how well MLLMs align their outputs with both visual input and textual instructions. Furthermore, by fine-tuning MLLMs on our dataset, we achieve substantial gains in visual instruction-following accuracy and adherence. Through extensive evaluation across representative MLLMs, we provide new insights into the strengths and limitations of current models.

中文标题/摘要

标题：赋能可靠的视觉中心指令遵循能力评估在MLLMs中的应用

评估多模态大型语言模型（MLLMs）的指令遵循（IF）能力对于严格评估模型输出与用户指定意图的一致性至关重要。然而，现有的MLLMs指令遵循能力评估基准主要集中在文本模态中的口头指令。这些限制阻碍了对指令遵循能力的全面分析，因为它们忽略了嵌入在语义丰富的视觉模态中的隐式约束。为解决这一问题，我们引入了VC-IFEval，这是一个新的基准，附带一个系统构建的数据集，用于评估MLLMs在多模态设置下的指令遵循能力。我们的基准系统地将视觉依赖性约束纳入指令设计中，使我们能够更严格和细致地评估MLLMs如何与其视觉输入和文本指令对齐。此外，通过在我们的数据集上微调MLLMs，我们在视觉指令遵循的准确性和一致性方面取得了显著的提升。通过在代表性MLLMs上的广泛评估，我们提供了关于当前模型优势和局限性的新见解。

Summary / 总结

The research aims to evaluate the instruction-following capabilities of MLLMs by introducing VC-IFEval, a new benchmark that includes a systematically constructed dataset focusing on multimodal settings. This method addresses the limitations of existing benchmarks that primarily use textual instructions, by incorporating vision-dependent constraints. Key findings show that fine-tuning MLLMs on this dataset improves visual instruction-following accuracy and adherence, providing new insights into the strengths and limitations of current models.

研究旨在通过解决现有基准主要关注文本指令的局限性，更全面地评估多模态大型语言模型（MLLMs）的指令遵循能力。研究引入了VC-IFEval，一个包含系统构建数据集的新基准，用于在多模态设置下评估MLLMs的能力，同时融入视觉依赖性约束。实验结果显示，通过对该数据集进行微调，MLLMs在视觉指令遵循的准确性和一致性方面取得了显著提升，提供了关于模型优缺点的新见解。

X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework

Authors: Mohammad Zia Ur Rehman, Sai Kartheek Reddy Kasu, Shashivardhan Reddy Koppula, Sai Rithwik Reddy Chirra, Shwetank Shekhar Singh, Nagendra Kumar

Venue: AAAI 2026

First: 2026-01-06T17:16:45+00:00 · Latest: 2026-01-06T17:16:45+00:00

Comments: Accepted in the proceedings of AAAI 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Hate speech detection on social media faces challenges in both accuracy and explainability, especially for underexplored Indic languages. We propose a novel explainability-guided training framework, X-MuTeST (eXplainable Multilingual haTe Speech deTection), for hate speech detection that combines high-level semantic reasoning from large language models (LLMs) with traditional attention-enhancing techniques. We extend this research to Hindi and Telugu alongside English by providing benchmark human-annotated rationales for each word to justify the assigned class label. The X-MuTeST explainability method computes the difference between the prediction probabilities of the original text and those of unigrams, bigrams, and trigrams. Final explanations are computed as the union between LLM explanations and X-MuTeST explanations. We show that leveraging human rationales during training enhances both classification performance and explainability. Moreover, combining human rationales with our explainability method to refine the model attention yields further improvements. We evaluate explainability using Plausibility metrics such as Token-F1 and IOU-F1 and Faithfulness metrics such as Comprehensiveness and Sufficiency. By focusing on under-resourced languages, our work advances hate speech detection across diverse linguistic contexts. Our dataset includes token-level rationale annotations for 6,004 Hindi, 4,492 Telugu, and 6,334 English samples. Data and code are available on https://github.com/ziarehman30/X-MuTeST

中文标题/摘要

标题：X-MuTeST：一种可解释的多语言仇恨言论检测基准及新型LLM咨询解释框架

社交媒体上的仇恨言论检测面临着准确性和可解释性的双重挑战，尤其是在未充分探索的印地语等语言方面。我们提出了一种新的可解释性指导训练框架X-MuTeST（可解释的多语言仇恨言论检测），该框架结合了大型语言模型（LLM）的高层次语义推理与传统的注意力增强技术。我们通过为每个单词提供基准的人工标注理由，将这项研究扩展到印地语、泰卢固语和英语。X-MuTeST的可解释性方法计算原始文本与单字、双字和三字的预测概率之间的差异。最终解释是LLM解释与X-MuTeST解释的并集。我们展示了在训练过程中利用人工理由可以提高分类性能和可解释性。此外，将人工理由与我们的可解释性方法结合以细化模型注意力可以进一步提高性能。我们使用可实现性度量（如Token-F1和IOU-F1）和忠实度度量（如全面性和充分性）来评估可解释性。通过关注资源不足的语言，我们的工作促进了仇恨言论检测在多种语言环境中的发展。我们的数据集包括6,004个印地语、4,492个泰卢固语和6,334个英语样本的标记级别理由注释。数据和代码可在https://github.com/ziarehman30/X-MuTeST获取。

Summary / 总结

The research aims to improve the accuracy and explainability of hate speech detection, particularly for underexplored Indic languages like Hindi and Telugu. It introduces X-MuTeST, an explainability-guided training framework that combines LLMs and traditional attention-enhancing techniques. The study shows that using human-annotated rationales enhances both classification performance and explainability, and combining these with the X-MuTeST method further improves the model's attention. The dataset includes 6,004 Hindi, 4,492 Telugu, and 6,334 English samples with token-level rationale annotations, evaluated using Plausibility and Faithfulness metrics.

研究旨在提高仇恨言论检测的准确性和可解释性，特别是对于尚未充分探索的印地语如印地语和泰卢固语。研究引入了X-MuTeST，这是一种结合了大型语言模型和传统注意力增强技术的新型可解释性指导训练框架。研究表明，在训练过程中使用人类标注的理由可以提高分类性能和可解释性。将这些理由与可解释性方法结合使用可以进一步提高模型的注意力。评估使用可验证性和忠实度指标，展示了在多种语言背景下仇恨言论检测的进步。

D^3ETOR: Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing for Weakly-Supervised Camouflaged Object Detection with Scribble Annotations

Authors: Jiawei Ge, Jiuxin Cao, Xinyi Li, Xuelin Zhu, Chang Liu, Bo Liu, Chen Feng, Ioannis Patras

First: 2025-12-23T11:16:16+00:00 · Latest: 2026-01-06T17:16:18+00:00

Abs · PDF · Code1 · Code2

Abstract

Weakly-Supervised Camouflaged Object Detection (WSCOD) aims to locate and segment objects that are visually concealed within their surrounding scenes, relying solely on sparse supervision such as scribble annotations. Despite recent progress, existing WSCOD methods still lag far behind fully supervised ones due to two major limitations: (1) the pseudo masks generated by general-purpose segmentation models (e.g., SAM) and filtered via rules are often unreliable, as these models lack the task-specific semantic understanding required for effective pseudo labeling in COD; and (2) the neglect of inherent annotation bias in scribbles, which hinders the model from capturing the global structure of camouflaged objects. To overcome these challenges, we propose ${D}^{3}$ETOR, a two-stage WSCOD framework consisting of Debate-Enhanced Pseudo Labeling and Frequency-Aware Progressive Debiasing. In the first stage, we introduce an adaptive entropy-driven point sampling method and a multi-agent debate mechanism to enhance the capability of SAM for COD, improving the interpretability and precision of pseudo masks. In the second stage, we design FADeNet, which progressively fuses multi-level frequency-aware features to balance global semantic understanding with local detail modeling, while dynamically reweighting supervision strength across regions to alleviate scribble bias. By jointly exploiting the supervision signals from both the pseudo masks and scribble semantics, ${D}^{3}$ETOR significantly narrows the gap between weakly and fully supervised COD, achieving state-of-the-art performance on multiple benchmarks.

中文标题/摘要

标题：D^3ETOR：辩论增强的伪标签生成和频率感知渐进去偏见方法在带有涂鸦注释的弱监督伪装目标检测中的应用

弱监督伪装目标检测（WSCOD）旨在定位并分割其周围场景中视觉上隐蔽的目标，仅依赖稀疏监督，如涂鸦注释。尽管取得了进展，但现有WSCOD方法仍远落后于完全监督方法，主要由于两个限制：（1）由通用分割模型（如SAM）生成的伪掩码，通过规则过滤后往往不可靠，因为这些模型缺乏COD所需的特定语义理解；（2）忽视涂鸦注释中的固有偏差，阻碍模型捕捉伪装目标的全局结构。为克服这些挑战，我们提出D^3ETOR，一种两阶段WSCOD框架，包括辩论增强的伪标签生成和频率感知渐进去偏见。在第一阶段，我们引入自适应熵驱动的点采样方法和多智能体辩论机制，增强SAM的伪装目标检测能力，提高伪掩码的可解释性和精度。在第二阶段，我们设计FADeNet，通过逐步融合多级频率感知特征，平衡全局语义理解与局部细节建模，同时动态调整监督强度以缓解涂鸦偏差。通过同时利用伪掩码和涂鸦语义的监督信号，D^3ETOR显著缩小了弱监督与完全监督目标检测之间的差距，实现了多个基准上的最佳性能。

Summary / 总结

The paper addresses the challenges in Weakly-Supervised Camouflaged Object Detection (WSCOD) by proposing ${D}^{3}$ETOR, a two-stage framework. The first stage enhances pseudo labeling using an adaptive entropy-driven point sampling method and a multi-agent debate mechanism to improve the interpretability and precision of pseudo masks. The second stage introduces FADeNet, which fuses multi-level frequency-aware features to balance global and local details while dynamically reweighting supervision strength to reduce annotation bias. Experimental results show that ${D}^{3}$ETOR outperforms existing methods and achieves state-of-the-art performance on multiple benchmarks.

论文提出了一种两阶段框架D^3ETOR来解决弱监督伪装目标检测（WSCOD）的挑战。第一阶段通过自适应熵驱动的点采样方法和多智能体辩论机制增强伪标签，提高伪掩码的可靠性和可解释性。第二阶段引入了FADeNet，通过融合多级频率感知特征来平衡全局语义理解和局部细节建模，并动态调整监督强度以减轻注释偏差。实验结果表明，D^3ETOR在多个基准上超过了现有方法，达到了最先进的性能。

UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

Authors: Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, Yi Cao, Feng Zhao

First: 2026-01-06T17:15:50+00:00 · Latest: 2026-01-06T17:15:50+00:00

Abs · PDF · Code1 · Code2

Abstract

While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.

中文标题/摘要

标题：UniCorn：通过自我生成的监督实现自我提升的统一多模态模型

虽然统一多模态模型（UMMs）在跨模态理解方面取得了显著成功，但在利用这种内部知识进行高质量生成方面仍存在显著差距。我们将这种差异形式化为传导失语症，这是一种现象，即模型能够准确解释多模态输入，但在将其理解转化为忠实且可控的合成方面却存在困难。为了解决这一问题，我们提出了UniCorn，这是一种简单而优雅的自我提升框架，无需外部数据或教师监督。通过将单一UMM划分为三个协作角色：提案者、解决者和裁判，UniCorn通过自我对弈生成高质量的交互，并利用认知模式重构将潜在理解提炼为明确的生成信号。为了验证多模态一致性的恢复，我们引入了基于文本到图像再到文本重建循环的UniCycle循环一致性基准。广泛的实验表明，UniCorn在六个通用图像生成基准上实现了全面和显著的改进。值得注意的是，它在TIIF（73.8）、DPG（86.8）、CompBench（88.5）和UniCycle上达到了SOTA性能，并进一步在WISE上实现了+5.0的显著提升，在OneIG上实现了+6.5的显著提升。这些结果表明，我们的方法显著增强了T2I生成能力，同时保持了稳健的理解能力，展示了统一多模态智能完全自我监督改进的可扩展性。

Summary / 总结

UniCorn addresses the gap in Unified Multimodal Models (UMMs) by proposing a self-improvement framework that uses self-generated supervision. It partitions a UMM into Proposer, Solver, and Judge roles to generate high-quality interactions through self-play and cognitive pattern reconstruction. Experiments show that UniCorn significantly improves multimodal coherence and achieves state-of-the-art performance on six general image generation benchmarks, including TIIF, DPG, and CompBench, while also enhancing T2I generation and maintaining robust comprehension.

UniCorn通过提出一个自我改进框架，使用自我生成的监督来解决统一多模态模型（UMMs）的问题，无需外部数据。它将UMM划分为提案者、解决者和裁判者角色，生成高质量的互动，并通过认知模式重构进行提炼。实验表明，UniCorn在六个基准测试中提高了性能，实现了在TIIF、DPG和CompBench上的SOTA结果，并在WISE和OneIG上取得了显著的提升，证明了完全自我监督改进的可扩展性。