arXiv 论文速递

Snapshot: 20260129_0331

Evaluation of Oncotimia: An LLM based system for supporting tumour boards

Authors: Luis Lorenzo, Marcos Montana-Mendez, Sergio Figueiras, Miguel Boubeta, Cristobal Bernardo-Castineira

First: 2026-01-27T18:59:38+00:00 · Latest: 2026-01-27T18:59:38+00:00

Comments: 9 pages, 2 figures

Abstract

Multidisciplinary tumour boards (MDTBs) play a central role in oncology decision-making but require manual processes and structuring large volumes of heterogeneous clinical information, resulting in a substantial documentation burden. In this work, we present ONCOTIMIA, a modular and secure clinical tool designed to integrate generative artificial intelligence (GenAI) into oncology workflows and evaluate its application to the automatic completion of lung cancer tumour board forms using large language models (LLMs). The system combines a multi-layer data lake, hybrid relational and vector storage, retrieval-augmented generation (RAG) and a rule-driven adaptive form model to transform unstructured clinical documentation into structured and standardised tumour board records. We assess the performance of six LLMs deployed through AWS Bedrock on ten lung cancer cases, measuring both completion form accuracy and end-to-end latency. The results demonstrate high performance across models, with the best performing configuration achieving an 80% of correct field completion and clinically acceptable response time for most LLMs. Larger and more recent models exhibit best accuracies without incurring prohibitive latency. These findings provide empirical evidence that LLM- assisted autocompletion form is technically feasible and operationally viable in multidisciplinary lung cancer workflows and support its potential to significantly reduce documentation burden while preserving data quality.

中文标题/摘要

标题：Oncotimia的评估：基于LLM的肿瘤董事会支持系统

多学科肿瘤董事会（MDTBs）在肿瘤学决策中发挥着核心作用，但需要手动过程和结构化大量异质临床信息，导致大量的文档负担。在本工作中，我们介绍了ONCOTIMIA，一个模块化和安全的临床工具，旨在将生成型人工智能（GenAI）集成到肿瘤学工作流程中，并评估其在使用大型语言模型（LLMs）自动完成肺癌肿瘤董事会表格中的应用。该系统结合了多层数据湖、混合关系和向量存储、检索增强生成（RAG）和基于规则的自适应表单模型，将非结构化的临床文档转换为结构化和标准化的肿瘤董事会记录。我们通过AWS Bedrock部署了六种LLM，并在十例肺癌病例上评估了其性能，测量了表格完成的准确性和端到端的延迟。结果表明，模型性能很高，最佳配置实现了80%的正确字段完成，并且大多数LLM具有临床可接受的响应时间。较大的、更近期的模型在不增加不可接受的延迟的情况下表现出最高的准确性。这些发现提供了实证证据，证明LLM辅助的自动完成表单在多学科肺癌工作流程中是技术上可行和操作上可行的，并支持其在显著减少文档负担的同时保持数据质量的潜力。

Summary / 总结

The study evaluates ONCOTIMIA, a system using large language models (LLMs) to support multidisciplinary tumour boards in oncology, focusing on automatic completion of lung cancer tumour board forms. The system integrates a multi-layer data lake, hybrid storage, and a rule-driven form model to transform unstructured clinical data into structured records. Six LLMs were tested on ten lung cancer cases, achieving up to 80% correct field completion with clinically acceptable latency, suggesting LLM-assisted autocompletion is feasible and viable for reducing documentation burden while maintaining data quality.

本研究评估了使用大型语言模型自动完成肺癌肿瘤会议表单的ONCOTIMIA系统。该系统结合了多层数据湖、RAG和基于规则的表单模型，将非结构化的临床文档转换为结构化的记录。六个LLM在十个肺癌病例上进行了测试，最高正确字段完成率达到80%，且具有临床可接受的响应时间，尤其是对于更大和更新的模型。这表明LLM辅助的自动完成表单在肿瘤学工作流程中是技术上可行和操作上可行的，有可能减少文档负担同时保持数据质量。

DuwatBench: Bridging Language and Visual Heritage through an Arabic Calligraphy Benchmark for Multimodal Understanding

Authors: Shubham Patle, Sara Ghaboura, Hania Tariq, Mohammad Usman Khan, Omkar Thawakar, Rao Muhammad Anwer, Salman Khan

First: 2026-01-27T18:59:19+00:00 · Latest: 2026-01-27T18:59:19+00:00

Comments: Accepted to EACL-2026 (Main Track)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Arabic calligraphy represents one of the richest visual traditions of the Arabic language, blending linguistic meaning with artistic form. Although multimodal models have advanced across languages, their ability to process Arabic script, especially in artistic and stylized calligraphic forms, remains largely unexplored. To address this gap, we present DuwatBench, a benchmark of 1,272 curated samples containing about 1,475 unique words across six classical and modern calligraphic styles, each paired with sentence-level detection annotations. The dataset reflects real-world challenges in Arabic writing, such as complex stroke patterns, dense ligatures, and stylistic variations that often challenge standard text recognition systems. Using DuwatBench, we evaluated 13 leading Arabic and multilingual multimodal models and showed that while they perform well on clean text, they struggle with calligraphic variation, artistic distortions, and precise visual-text alignment. By publicly releasing DuwatBench and its annotations, we aim to advance culturally grounded multimodal research, foster fair inclusion of the Arabic language and visual heritage in AI systems, and support continued progress in this area. Our dataset (https://huggingface.co/datasets/MBZUAI/DuwatBench) and evaluation suit (https://github.com/mbzuai-oryx/DuwatBench) are publicly available.

中文标题/摘要

标题：DuwatBench：通过阿拉伯书法基准促进语言与视觉遗产的融合以实现跨模态理解

阿拉伯书法是阿拉伯语言最丰富的视觉传统之一，将语言意义与艺术形式融为一体。尽管跨语言的多模态模型已经取得了进展，但它们处理阿拉伯书法的能力，尤其是艺术性和风格化的书法形式，仍然鲜有探索。为了解决这一差距，我们提出了DuwatBench，这是一个包含1,272个精心挑选的样本的数据集，这些样本涵盖了大约1,475个不同词汇，跨越了六种古典和现代书法风格，每个样本都配有关于句子级别的检测注释。该数据集反映了阿拉伯书写中的现实挑战，如复杂的笔画模式、密集的连字以及风格上的变化，这些往往给标准的文本识别系统带来了挑战。使用DuwatBench，我们评估了13个领先的阿拉伯语和多语言多模态模型，并展示了它们在处理书法变化、艺术变形和精确的视觉-文本对齐方面存在困难。通过公开发布DuwatBench及其注释，我们旨在推动文化背景下的多模态研究，促进阿拉伯语言和视觉遗产在人工智能系统中的公平包容，并支持该领域的持续进步。我们的数据集（https://huggingface.co/datasets/MBZUAI/DuwatBench）和评估工具（https://github.com/mbzuai-oryx/DuwatBench）已公开发布。

Summary / 总结

DuwatBench is a benchmark dataset for Arabic calligraphy that bridges language and visual heritage, containing 1,272 samples with sentence-level annotations across six calligraphic styles. It evaluates 13 leading Arabic and multilingual multimodal models, revealing their limitations in handling calligraphic variation and artistic distortions. The dataset aims to advance culturally grounded multimodal research and support the inclusion of Arabic visual heritage in AI systems.

DuwatBench 是一个包含 1,272 个样本的阿拉伯书法基准数据集，每个样本配有句子级别的注释，涵盖了六种书法风格。该数据集旨在解决多模态模型在处理阿拉伯书法，尤其是艺术性和风格化形式时的性能差距。对 13 个领先模型的评估显示，它们在书法变体和精确的视觉-文本对齐方面表现不佳，突显了需要改进对阿拉伯书法的多模态理解。通过发布此数据集和注释，作者旨在推动文化导向的多模态研究，并支持阿拉伯视觉遗产在人工智能系统中的包容性。

Self-Distillation Enables Continual Learning

Authors: Idan Shenfeld, Mehul Damani, Jonas Hübotter, Pulkit Agrawal

First: 2026-01-27T18:59:08+00:00 · Latest: 2026-01-27T18:59:08+00:00

Abs · PDF · Code1 · Code2

Abstract

Continual learning, enabling models to acquire new skills and knowledge without degrading existing capabilities, remains a fundamental challenge for foundation models. While on-policy reinforcement learning can reduce forgetting, it requires explicit reward functions that are often unavailable. Learning from expert demonstrations, the primary alternative, is dominated by supervised fine-tuning (SFT), which is inherently off-policy. We introduce Self-Distillation Fine-Tuning (SDFT), a simple method that enables on-policy learning directly from demonstrations. SDFT leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills. Across skill learning and knowledge acquisition tasks, SDFT consistently outperforms SFT, achieving higher new-task accuracy while substantially reducing catastrophic forgetting. In sequential learning experiments, SDFT enables a single model to accumulate multiple skills over time without performance regression, establishing on-policy distillation as a practical path to continual learning from demonstrations.

中文标题/摘要

标题：自我蒸馏使连续学习成为可能

连续学习，使模型能够获取新技能和知识而不损害现有能力，仍然是基础模型面临的基本挑战。尽管在线策略强化学习可以减少遗忘，但它需要明确的奖励函数，这些函数往往不可用。从专家演示学习的主要替代方法是监督微调（SFT），这是固有的离策略方法。我们引入了自我蒸馏微调（SDFT），这是一种简单的方法，可以直接从演示中进行在线策略学习。SDFT 利用上下文学习，通过使用演示条件下的模型作为自己的教师，生成保留先前能力的同时获取新技能的在线策略训练信号。在技能学习和知识获取任务中，SDFT 一致优于 SFT，实现更高的新任务准确率，同时显著减少灾难性遗忘。在顺序学习实验中，SDFT 使单个模型能够在不出现性能退化的情况下随着时间的推移积累多种技能，确立了在线策略蒸馏作为从演示中实现连续学习的实用途径。

Summary / 总结

The paper addresses the challenge of continual learning in foundation models, where models need to acquire new skills without forgetting existing ones. It introduces Self-Distillation Fine-Tuning (SDFT), a method that uses a demonstration-conditioned model as its own teacher to generate on-policy training signals. SDFT outperforms supervised fine-tuning in both skill learning and knowledge acquisition tasks, achieving better new-task accuracy and reducing catastrophic forgetting. In sequential learning, SDFT allows a single model to learn multiple skills over time without performance degradation.

论文解决了基础模型在不断学习新技能时不忘记已有能力的挑战。它引入了自我蒸馏微调（SDFT）方法，通过使用演示条件下的模型作为自己的教师来生成基于策略的训练信号。该方法在技能学习和知识获取任务中均优于监督微调，显示出更高的新任务准确率和更低的灾难性遗忘。在顺序学习实验中，SDFT使单个模型能够在不降低性能的情况下逐步学习多个技能。

Post-LayerNorm Is Back: Stable, ExpressivE, and Deep

Authors: Chen Chen, Lai Wei

First: 2026-01-27T18:58:46+00:00 · Latest: 2026-01-27T18:58:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Large language model (LLM) scaling is hitting a wall. Widening models yields diminishing returns, and extending context length does not improve fundamental expressivity. In contrast, depth scaling offers theoretically superior expressivity, yet current Transformer architectures struggle to train reliably at extreme depths. We revisit the Post-LayerNorm (Post-LN) formulation, whose instability at scale caused its replacement by Pre-LN in modern LLMs. We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. We present Keel, a Post-LN Transformer that replaces this residual path with a Highway-style connection. This modification preserves the gradient flow through the residual branch, preventing signal vanishing from the top layers to the bottom. Unlike prior methods, Keel enables stable training at extreme depths without requiring specialized initialization or complex optimization tricks. Keel trains robustly at depths exceeding 1000 layers and consistently improves perplexity and depth-scaling characteristics over Pre-LN. These findings indicate that Post-LN, when paired with a Highway-style connection, provides a simple and effective foundation for building deeply scalable LLMs, opening the possibility for future infinite-depth architectures.

中文标题/摘要

标题：后层归一化又回来了：稳定、表达能力强且深度大

大型语言模型（LLM）的扩展遇到了瓶颈。增加模型宽度带来的收益递减，扩展上下文长度也无法从根本上提高表达能力。相比之下，深度扩展理论上能提供更好的表达能力，但当前的Transformer架构在极端深度下难以可靠地进行训练。我们重新审视了后层归一化（Post-LN）的表述，这种表述在大规模下不稳定，导致现代LLM中被前层归一化（Pre-LN）取代。我们展示了Keel，这是一种Post-LN Transformer，用Highway风格的连接替代了残差路径，从而保持了残差分支中的梯度流动，防止信号从顶层消失到底层。与之前的方法不同，Keel能够在不依赖特殊初始化或复杂优化技巧的情况下，在超过1000层的深度下稳定训练，并且在困惑度和深度扩展特性上优于Pre-LN。这些发现表明，当与Highway风格的连接结合时，Post-LN提供了一个简单而有效的基础，用于构建深度可扩展的LLM，开启了未来无限深度架构的可能性。

Summary / 总结

The paper addresses the challenge of training deep Transformers reliably, which is crucial for scaling large language models. It revisits the Post-LayerNorm (Post-LN) formulation, which was abandoned due to instability issues at scale. The authors propose Keel, a Post-LN Transformer that uses a Highway-style connection instead of the ResNet-style residual pathway, ensuring stable gradient flow and enabling reliable training at extreme depths. Keel outperforms Pre-LN models in terms of perplexity and depth-scaling characteristics, demonstrating the effectiveness of this approach for building deeply scalable language models.

论文通过重新审视Post-LayerNorm（Post-LN）公式来解决大规模语言模型（LLM）的扩展问题。它引入了Keel，这是一种使用Highway风格连接而非ResNet风格残差路径的Post-LN Transformer，以防止梯度消失，从而在极端深度下实现可靠的训练。Keel在深度扩展特性上优于前方法，持续提高困惑度和深度扩展特性，表明Post-LN与Highway风格连接结合时，可以为构建深度可扩展的LLM提供简单有效的基础，开启了未来无限深度架构的可能性。

"Not in My Backyard": LLMs Uncover Online and Offline Social Biases Against Homelessnes

Authors: Jonathan A. Karr, Benjamin F. Herbst, Matthew L. Sisk, Xueyun Li, Ting Hua, Matthew Hauenstein, Georgina Curto, Nitesh V. Chawla

First: 2025-08-14T17:58:34+00:00 · Latest: 2026-01-27T18:56:57+00:00

Abs · PDF · Code1 · Code2

Abstract

Homelessness is a persistent social challenge, impacting millions worldwide. Over 876,000 people experienced homelessness (PEH) in the U.S. in 2025. Social bias is a significant barrier to alleviation, shaping public perception and influencing policymaking. Given that online textual media and offline city council discourse reflect and influence part of public opinion, it provides valuable insights to identify and track social biases against PEH. We present a new, manually-annotated multi-domain dataset compiled from Reddit, X (formerly Twitter), news articles, and city council meeting minutes across ten U.S. cities. Our 16-category multi-label taxonomy creates a challenging long-tail classification problem: some categories appear in less than 1% of samples, while others exceed 70%. We find that small human-annotated datasets (1,702 samples) are insufficient for training effective classifiers, whether used to fine-tune encoder models or as few-shot examples for LLMs. To address this, we use GPT-4.1 to generate pseudo-labels on a larger unlabeled corpus. Training on this expanded dataset enables even small encoder models (ModernBERT, 150M parameters) to achieve 35.23 macro-F1, approaching GPT-4.1's 41.57. This demonstrates that \textbf{data quantity matters more than model size}, enabling low-cost, privacy-preserving deployment without relying on commercial APIs. Our results reveal that negative bias against PEH is prevalent both offline and online (especially on Reddit), with "not in my backyard" narratives showing the highest engagement. These findings uncover a type of ostracism that directly impacts poverty-reduction policymaking and provide actionable insights for practitioners addressing homelessness.

中文标题/摘要

标题："不在我后院": 大型语言模型揭示对无家可归者的线上线下社会偏见

无家可归是一个持续的社会挑战，影响着全世界数百万人。2025年，美国有超过876,000人经历无家可归（PEH）。社会偏见是缓解这一问题的重要障碍，影响公众认知并影响政策制定。鉴于在线文本媒体和线下城市议会讨论反映了部分公众意见并对其产生影响，它们提供了识别和追踪对PEH的社会偏见的重要见解。我们提出了一项新的、人工标注的多领域数据集，该数据集从Reddit、X（原Twitter）、新闻文章以及美国十个城市的市政会议记录中收集。我们的16类多标签分类体系构成了一个具有挑战性的长尾分类问题：一些类别在样本中出现的比例不到1%，而另一些类别则超过70%。我们发现，小规模的人工标注数据集（1,702个样本）不足以训练有效的分类器，无论是用于微调编码器模型还是作为LLM的少量示例。为了解决这个问题，我们使用GPT-4.1在更大规模的未标注语料库上生成伪标签。在扩展数据集上进行训练使即使是小型编码器模型（ModernBERT，1.5亿参数）也能达到35.23的宏F1值，接近GPT-4.1的41.57。这表明数据量比模型规模更重要，能够实现低成本、隐私保护的部署，无需依赖商业API。我们的研究结果揭示了无家可归者在线上线下（尤其是Reddit）都普遍存在负面偏见，"不在我后院"的叙事具有最高的参与度。这些发现揭示了一种直接影响减贫政策制定的排斥行为，并为解决无家可归问题的从业者提供了可操作的见解。

Summary / 总结

The paper aims to identify and track social biases against people experiencing homelessness (PEH) through online and offline data. It presents a manually-annotated dataset from various sources including Reddit, X, news articles, and city council meeting minutes. The study finds that small annotated datasets are insufficient for effective classification, and instead uses GPT-4.1 to generate pseudo-labels, enabling even small models to achieve high performance. Key findings include the prevalence of negative bias against PEH, particularly on Reddit, and the high engagement with 'not in my backyard' narratives.

论文旨在通过在线和离线数据识别和追踪对经历无家可归的人（PEH）的社会偏见。研究呈现了一个从Reddit、X、新闻文章和市政会议记录等多种来源手动标注的数据集。研究发现，小型标注数据集不足以进行有效的分类，而是使用GPT-4.1生成伪标签，即使使用小型模型也能达到高性能。关键发现包括PEH面临的负面偏见普遍存在，特别是在Reddit上，并且‘不在我的后院’的叙事具有最高的参与度。

VGGT-SLAM 2.0: Real time Dense Feed-forward Scene Reconstruction

Authors: Dominic Maggio, Luca Carlone

First: 2026-01-27T18:54:29+00:00 · Latest: 2026-01-27T18:54:29+00:00

Abs · PDF · Code1 · Code2

Abstract

We present VGGT-SLAM 2.0, a real time RGB feed-forward SLAM system which substantially improves upon VGGT-SLAM for incrementally aligning submaps created from VGGT. Firstly, we remove high-dimensional 15-degree-of-freedom drift and planar degeneracy from VGGT-SLAM by creating a new factor graph design while still addressing the reconstruction ambiguity of VGGT given unknown camera intrinsics. Secondly, by studying the attention layers of VGGT, we show that one of the layers is well suited to assist in image retrieval verification for free without additional training, which enables both rejecting false positive matches and allows for completing more loop closures. Finally, we conduct a suite of experiments which includes showing VGGT-SLAM 2.0 can easily be adapted for open-set object detection and demonstrating real time performance while running online onboard a ground robot using a Jetson Thor. We also test in environments ranging from cluttered indoor apartments and office scenes to a 4,200 square foot barn, and we also demonstrate VGGT-SLAM 2.0 achieves the highest accuracy on the TUM dataset with about 23 percent less pose error than VGGT-SLAM. Code will be released upon publication.

中文标题/摘要

标题：VGGT-SLAM 2.0：实时密集前馈场景重建

我们提出了VGGT-SLAM 2.0，这是一种实时RGB前馈SLAM系统，相比VGGT-SLAM在逐步对齐由VGGT创建的子地图方面有了显著改进。首先，通过创建新的因子图设计，我们消除了VGGT-SLAM中的15自由度漂移和平面退化问题，同时解决了给定未知相机内参时VGGT的重建歧义性。其次，通过研究VGGT的注意力层，我们展示了其中一个层非常适合在无需额外训练的情况下辅助图像检索验证，从而能够拒绝假阳性匹配并允许完成更多的闭环。最后，我们进行了一系列实验，包括展示VGGT-SLAM 2.0可以轻松适应开放集物体检测，并在使用Jetson Thor在线运行于地面机器人上展示实时性能。我们还在从杂乱的室内公寓和办公室场景到4200平方英尺的谷仓的环境中进行了测试，并展示了VGGT-SLAM 2.0在TUM数据集上的精度最高，比VGGT-SLAM的位姿误差低约23%。代码将在发表后发布。

Summary / 总结

VGGT-SLAM 2.0 is a real-time RGB feed-forward SLAM system that improves upon VGGT-SLAM by addressing drift and planar degeneracy through a new factor graph design. It also leverages VGGT's attention layers for image retrieval verification, enhancing loop closures and reducing false positives. Experimental results show VGGT-SLAM 2.0 outperforms VGGT-SLAM with about 23 percent less pose error and demonstrates real-time performance on a ground robot. It also shows adaptability for open-set object detection and works effectively in various environments, including large barns and indoor spaces.

VGGT-SLAM 2.0 是一种实时 RGB 前馈 SLAM 系统，通过新的因子图设计解决了漂移和平面退化问题。它还利用 VGGT 的注意力层进行图像检索验证，提高了闭环检测并拒绝假阳性匹配。实验表明，VGGT-SLAM 2.0 的精度高于 VGGT-SLAM，姿态误差减少了约 23%，并在地面机器人上实现了实时性能。它在各种环境中表现良好，包括拥挤的室内空间和大型谷仓。

Demystifying the Roles of LLM Layers in Retrieval, Knowledge, and Reasoning

Authors: Xinyuan Song, Keyu Wang, PengXiang Li, Lu Yin, Shiwei Liu

Venue: ICASSP 2026

First: 2025-10-02T14:57:13+00:00 · Latest: 2026-01-27T18:53:30+00:00

Comments: Accepted by ICASSP 2026

Abs · PDF · Code1 · Code2

Abstract

Recent studies suggest that the deeper layers of Large Language Models (LLMs) contribute little to representation learning and can often be removed without significant performance loss. However, such claims are typically drawn from narrow evaluations and may overlook important aspects of model behavior. In this work, we present a systematic study of depth utilization across diverse dimensions, including evaluation protocols, task categories, and model architectures. Our analysis confirms that very deep layers are generally less effective than earlier ones, but their contributions vary substantially with the evaluation setting. Under likelihood-based metrics without generation, pruning most layers preserves performance, with only the initial few being critical. By contrast, generation-based evaluation uncovers indispensable roles for middle and deeper layers in enabling reasoning and maintaining long-range coherence. We further find that knowledge and retrieval are concentrated in shallow components, whereas reasoning accuracy relies heavily on deeper layers -- yet can be reshaped through distillation. These results highlight that depth usage in LLMs is highly heterogeneous and context-dependent, underscoring the need for task-, metric-, and model-aware perspectives in both interpreting and compressing large models.

中文标题/摘要

标题：揭开大型语言模型层在检索、知识和推理中角色的面纱

近期研究表明，大型语言模型（LLMs）的深层结构对表示学习的贡献甚微，通常可以在不显著影响性能的情况下移除。然而，此类结论通常基于狭隘的评估，可能忽略了模型行为的重要方面。在本研究中，我们对深度利用进行了系统性研究，涵盖了评估协议、任务类别和模型架构等多个维度。我们的分析证实，非常深层的结构通常不如早期的结构有效，但其贡献会根据评估环境有很大差异。在基于似然性的度量中不涉及生成时，剪枝大部分层可以保持性能，只有最初的几层是关键的。相比之下，基于生成的评估揭示了中间和深层结构在实现推理和保持长程一致性方面不可或缺的作用。我们还发现，知识和检索集中在浅层组件中，而推理准确性则高度依赖于深层结构——但可以通过蒸馏重塑。这些结果表明，LLMs中的深度使用是高度异质性和环境依赖性的，强调了在解释和压缩大型模型时需要任务、度量和模型意识的重要性。

Summary / 总结

This study investigates the roles of different layers in Large Language Models (LLMs) across various evaluation settings, finding that while deeper layers are less effective in representation learning, they play crucial roles in reasoning and maintaining long-range coherence during generation tasks. Shallow layers are more important for knowledge and retrieval, but reasoning accuracy heavily depends on deeper layers, which can be improved through distillation. The results suggest that the usage of depth in LLMs is highly context-dependent and requires a task-, metric-, and model-aware approach.

研究探讨了不同层在大型语言模型（LLMs）中的作用，发现较深的层在表示学习中效果较差，但在生成任务中的推理和长程连贯性方面起着关键作用。浅层层对于知识和检索更重要，而深层层对于推理准确性至关重要，尽管这一点可以通过蒸馏来改变。结果表明，深度在LLMs中的使用高度依赖于上下文，并需要任务、度量和模型的视角。

Reflective Translation: Improving Low-Resource Machine Translation via Structured Self-Reflection

Authors: Nicholas Cheng

Venue: NeurIPS 2025

First: 2026-01-27T18:37:09+00:00 · Latest: 2026-01-27T18:37:09+00:00

Comments: 12 pages, 3 figures, 6 tables. Accepted to the NeurIPS 2025 Workshop on Multilingual Representation Learning (Mexico City) and the AAAI 2025 Workshop on Language Models for Under-Resourced Communities (LM4UC). Code and data available at: https://github.com/Nickcheng123/reflective-translation-mt

Abs · PDF · Code1 · Code2 · Code3

Abstract

Low-resource languages such as isiZulu and isiXhosa face persistent challenges in machine translation due to limited parallel data and linguistic resources. Recent advances in large language models suggest that self-reflection, prompting a model to critique and revise its own outputs, can improve reasoning quality and factual consistency. Building on this idea, this paper introduces Reflective Translation, a prompt-based framework in which a model generates an initial translation, produces a structured self-critique, and then uses this reflection to generate a refined translation. The approach is evaluated on English-isiZulu and English-isiXhosa translation using OPUS-100 and NTREX-African, across multiple prompting strategies and confidence thresholds. Results show consistent improvements in both BLEU and COMET scores between first- and second-pass translations, with average gains of up to +0.22 BLEU and +0.18 COMET. Statistical significance testing using paired nonparametric tests confirms that these improvements are robust. The proposed method is model-agnostic, requires no fine-tuning, and introduces a reflection-augmented dataset that can support future supervised or analysis-driven work. These findings demonstrate that structured self-reflection is a practical and effective mechanism for improving translation quality in low-resource settings.

中文标题/摘要

标题：反思性翻译：通过结构化自我反思提高低资源机器翻译

低资源语言如祖鲁语和科萨语由于平行数据和语言资源有限，在机器翻译中面临持续挑战。大型语言模型的最新进展表明，自我反思，即促使模型自我批判和修订其输出，可以提高推理质量和事实一致性。基于这一理念，本文引入了反思性翻译，这是一种基于提示的框架，模型首先生成初始翻译，然后生成结构化的自我批判，并利用这种反思生成改进后的翻译。该方法使用OPUS-100和NTREX-非洲语对英-祖鲁语和英-科萨语翻译进行了评估，采用多种提示策略和置信阈值。结果显示，第一稿和第二稿之间的BLEU和COMET分数均有持续改进，平均增幅分别为+0.22 BLEU和+0.18 COMET。配对非参数检验统计显著性测试证实了这些改进的稳健性。所提方法具有模型通用性，无需微调，并引入了一个反思增强的数据集，可支持未来的监督或分析驱动工作。这些发现表明，结构化自我反思是提高低资源环境翻译质量的一种实用且有效机制。

Summary / 总结

This paper addresses the challenges of machine translation for low-resource languages like isiZulu and isiXhosa by proposing Reflective Translation, a framework that prompts models to generate a self-critique after an initial translation. The method consistently improves translation quality, as evidenced by up to +0.22 BLEU and +0.18 COMET score gains. Statistical tests confirm the robustness of these improvements. The approach is model-agnostic and does not require fine-tuning, making it a practical solution for low-resource settings.

本文通过引入Reflective Translation框架，该框架促使模型在初次翻译后生成自我批判，以解决低资源语言如祖鲁语和科萨语的机器翻译挑战。这一过程导致了改进的翻译，BLEU和COMET分数得到了一致的提升，平均增幅分别为+0.22 BLEU和+0.18 COMET。该方法无需微调，且提供了用于未来工作的反思增强数据集。

LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning

Authors: Obed Junias, Maria Leonor Pacheco

First: 2026-01-23T07:07:19+00:00 · Latest: 2026-01-27T18:33:20+00:00

Abs · PDF · Code1 · Code2

Abstract

Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that re-frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.

中文标题/摘要

标题：LOGICAL-COMMONSENSEQA：逻辑常识推理基准

常识推理通常涉及评估多个合理的解释，而不是选择单一的原子答案，然而大多数基准依赖于单标签评估，掩盖了陈述是否联合合理、相互排斥或联合不合理。我们引入了LOGICAL-COMMONSENSEQA，这是一个将常识推理重新定义为使用合理性级别操作符（AND，OR，NEITHER/NOR）对原子陈述进行逻辑组合的基准。在零样本、少量样本和链式思考提示下评估指令调优、推理专业化和微调模型，我们发现模型在联合推理方面表现合理，在析取推理方面表现适度，但在基于否定的问题上表现急剧下降。LOGICAL-COMMONSENSEQA 暴露了基本的推理限制，并提供了一个可控的框架以推进组合常识推理。

Summary / 总结

The research aims to evaluate commonsense reasoning by considering multiple plausible interpretations rather than single answers, addressing the limitations of existing benchmarks. The study introduces LOGICAL-COMMONSENSEQA, which assesses models on their ability to handle logical compositions of pairs of statements using AND, OR, and NEITHER/NOR operators. Key findings show that models perform well on conjunctive and moderately on disjunctive reasoning but struggle with negation-based questions, highlighting fundamental limitations in compositional commonsense reasoning.

研究旨在通过考虑多种可能的解释来评估常识推理，解决现有基准的局限性。研究引入了LOGICAL-COMMONSENSEQA，该基准评估模型在使用AND、OR和NEITHER/NOR操作符处理成对语句的逻辑组合时的能力。关键发现表明，模型在合取推理方面表现良好，在析取推理方面表现适度，但在否定推理方面表现不佳，突显了组合常识推理中的基本局限性。

Parameter-Efficient MoE LoRA for Few-Shot Multi-Style Editing

Authors: Cong Cao, Yujie Xu, Xiaodong Xu

First: 2025-11-14T12:40:21+00:00 · Latest: 2026-01-27T18:27:31+00:00

Comments: Technical report

Abs · PDF · Code1 · Code2 · Code3

Abstract

In recent years, image editing has garnered growing attention. However, general image editing models often fail to produce satisfactory results when confronted with new styles. The challenge lies in how to effectively fine-tune general image editing models to new styles using only a limited amount of paired data. To address this issue, this paper proposes a novel few-shot style editing framework. For this task, we construct a benchmark dataset that encompasses five distinct styles. Correspondingly, we propose a parameter-efficient multi-style Mixture-of-Experts Low-Rank Adaptation (MoE LoRA) with style-specific and style-shared routing mechanisms for jointly fine-tuning multiple styles. The style-specific routing ensures that different styles do not interfere with one another, while the style-shared routing adaptively allocates shared MoE LoRAs to learn common patterns. Our MoE LoRA can automatically determine the optimal ranks for each layer through a novel metric-guided approach that estimates the importance score of each single-rank component. Additionally, we explore the optimal location to insert LoRA within the Diffusion in Transformer (DiT) model and integrate adversarial learning and flow matching to guide the diffusion training process. Experimental results demonstrate that our proposed method outperforms existing state-of-the-art approaches with significantly fewer LoRA parameters. Our code and dataset are available at https://github.com/cao-cong/FSMSE.

中文标题/摘要

标题：参数高效MoE LoRA在少量示例多风格编辑中的应用

近年来，图像编辑引起了越来越多的关注。然而，通用图像编辑模型在面对新风格时往往无法产生令人满意的结果。挑战在于如何仅使用少量配对数据有效地微调通用图像编辑模型以适应新风格。为了解决这一问题，本文提出了一种新颖的少量示例风格编辑框架。为此任务，我们构建了一个包含五种不同风格的基准数据集。相应地，我们提出了一种参数高效的多风格Mixture-of-Experts Low-Rank Adaptation（MoE LoRA），并采用风格特定和风格共享的路由机制共同微调多种风格。风格特定的路由机制确保不同风格之间不会相互干扰，而风格共享的路由机制则能够自适应地分配共享的MoE LoRAs以学习共性模式。我们的MoE LoRA可以通过一种新颖的基于度量的方法自动确定每一层的最佳秩，该方法估计了每个单一秩组件的重要性得分。此外，我们探索了在Transformer中的扩散（DiT）模型中插入LoRA的最佳位置，并结合对抗学习和流匹配来引导扩散训练过程。实验结果表明，与现有最先进的方法相比，我们的方法在显著减少LoRA参数的情况下表现出更优的效果。我们的代码和数据集可在https://github.com/cao-cong/FSMSE上获取。

Summary / 总结

This paper addresses the challenge of fine-tuning general image editing models for new styles with limited paired data. It introduces a parameter-efficient multi-style Mixture-of-Experts Low-Rank Adaptation (MoE LoRA) framework with style-specific and style-shared routing mechanisms. The method automatically determines optimal ranks for each layer and integrates adversarial learning and flow matching to enhance diffusion training. Experiments show that the proposed method outperforms existing approaches with fewer LoRA parameters.

该论文旨在解决使用有限配对数据将通用图像编辑模型调整到新风格的挑战。它提出了一种参数高效的多风格Mixture-of-Experts Low-Rank Adaptation (MoE LoRA)框架，包含风格特定和风格共享路由机制。该方法能够自动确定每层的最佳秩，并结合对抗学习和流匹配来增强扩散训练过程。实验结果表明，所提出的方法在更少的LoRA参数下优于现有方法。

Calibration without Ground Truth

Authors: Yuqing Kong, Mingyu Song, Yizhou Wang, Yifan Wu

First: 2026-01-27T18:18:47+00:00 · Latest: 2026-01-27T18:18:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Villalobos et al. [2024] predict that publicly available human text will be exhausted within the next decade. Thus, improving models without access to ground-truth labels becomes increasingly important. We propose a label-free post-processing framework that improves a strong but miscalibrated model using a weaker yet better-calibrated reference. Our framework guarantees a strict performance improvement under any proper loss. Our approach is based on a characterization of when strict improvement is possible: when the strong and reference models are not mutually calibrated. We formalize this condition, connect it to arbitrage and no-trade results from economics, and develop an efficient Bregman projection algorithm that guarantees worst-case loss reduction without labels. Experiments on representative LLMs across varying scales demonstrate that our label-free method significantly reduces proper losses and calibration errors, achieving performance competitive with supervised baselines.

中文标题/摘要

标题：无需地面真实值的校准

Villalobos等人[2024]预测，公开的人类文本将在未来十年内耗尽。因此，在无法访问地面真实标签的情况下提高模型变得越来越重要。我们提出了一种无标签后处理框架，该框架使用较弱但校准更好的参考模型来改进一个强大但校准不良的模型。我们的框架在任何适当的损失下都能保证严格性能提升。我们的方法基于严格改进何时可能的表征：当强大模型和参考模型不相互校准时。我们对该条件进行了形式化，将其与经济学中的套利和不交易结果联系起来，并开发了一种高效的Bregman投影算法，该算法在无标签的情况下保证最坏情况下的损失减少。在不同规模的代表性LLM上的实验表明，我们的无标签方法显著减少了适当的损失和校准误差，实现了与监督基线相当的性能。

Summary / 总结

The research aims to improve model performance without using ground-truth labels, which will become crucial as publicly available human text diminishes. The authors propose a label-free post-processing framework that enhances a miscalibrated strong model using a better-calibrated but weaker reference model. This framework ensures a strict performance improvement under any proper loss. Experiments show that the proposed method significantly reduces proper losses and calibration errors, achieving performance comparable to supervised baselines.

研究旨在无需使用 ground-truth 标签的情况下提高模型性能，因为公共可用的人类文本预计将在不久的将来耗尽。作者提出了一种标签免费后处理框架，该框架利用一个校准更好但较弱的参考模型来增强一个校准较差但较强的模型。该框架在任何适当损失下都能确保严格性能改进。实验表明，所提出的方法显著减少了适当损失和校准误差，并且性能与监督基准相当。

MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents

Authors: Lukas Aichberger, Alasdair Paren, Guohao Li, Philip Torr, Yarin Gal, Adel Bibi

Venue: NeurIPS 2025

First: 2025-03-13T18:59:12+00:00 · Latest: 2026-01-27T18:10:17+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Recent advances in operating system (OS) agents have enabled vision-language models (VLMs) to directly control a user's computer. Unlike conventional VLMs that passively output text, OS agents autonomously perform computer-based tasks in response to a single user prompt. OS agents do so by capturing, parsing, and analysing screenshots and executing low-level actions via application programming interfaces (APIs), such as mouse clicks and keyboard inputs. This direct interaction with the OS significantly raises the stakes, as failures or manipulations can have immediate and tangible consequences. In this work, we uncover a novel attack vector against these OS agents: Malicious Image Patches (MIPs), adversarially perturbed screen regions that, when captured by an OS agent, induce it to perform harmful actions by exploiting specific APIs. For instance, a MIP can be embedded in a desktop wallpaper or shared on social media to cause an OS agent to exfiltrate sensitive user data. We show that MIPs generalise across user prompts and screen configurations, and that they can hijack multiple OS agents even during the execution of benign instructions. These findings expose critical security vulnerabilities in OS agents that have to be carefully addressed before their widespread deployment.

中文标题/摘要

标题：MIP对抗代理：恶意图像补丁劫持多模态OS代理

近年来操作系统(OS)代理的进步使视觉语言模型(VLMs)能够直接控制用户的计算机。与传统的被动输出文本的VLMs不同，OS代理能够自主地根据单一用户指令执行基于计算机的任务。OS代理通过捕获、解析和分析屏幕截图，并通过应用程序编程接口(APIs)执行低级操作（如鼠标点击和键盘输入）来实现这一目标。这种直接与OS的交互显著提高了风险，因为失败或操纵可能会立即产生实际后果。在本研究中，我们发现了一种针对这些OS代理的新攻击向量：恶意图像补丁(MIPs)，这些是通过对抗性扰动屏幕区域生成的，当被OS代理捕获时，会利用特定的APIs诱导其执行有害操作。例如，MIP可以嵌入在桌面上的壁纸中或在社交媒体上分享，以使OS代理泄露敏感用户数据。我们展示了MIPs在用户指令和屏幕配置方面具有泛化能力，并且即使在执行良性指令期间也能劫持多个OS代理。这些发现揭示了OS代理中关键的安全漏洞，这些漏洞在广泛部署之前必须仔细解决。

Summary / 总结

This research investigates the security risks of OS agents that can control a user's computer based on vision-language models. The study introduces Malicious Image Patches (MIPs) as a novel attack vector, which are adversarially perturbed images that can cause OS agents to perform harmful actions when captured. The key findings show that MIPs can generalize across different user prompts and screen configurations, and can hijack multiple OS agents even during benign tasks, highlighting critical security vulnerabilities.

该研究探讨了基于视觉语言模型的OS代理可能带来的安全风险。研究引入了恶意图像补丁（MIPs）作为新型攻击向量，这些补丁是经过对抗性扰动的图像，当被OS代理捕获时会导致其执行有害操作。主要发现表明，MIPs可以在不同的用户提示和屏幕配置下泛化，并且即使在执行良性任务时也能劫持多个OS代理，这揭示了OS代理中的关键安全漏洞。

MLVTG: Mamba-Based Feature Alignment and LLM-Driven Purification for Multi-Modal Video Temporal Grounding

Authors: Zhiyi Zhu, Xiaoyu Wu, Zihao Liu, Linlin Yang

First: 2025-06-10T07:20:12+00:00 · Latest: 2026-01-27T18:07:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Video Temporal Grounding (VTG), which aims to localize video clips corresponding to natural language queries, is a fundamental yet challenging task in video understanding. Existing Transformer-based methods often suffer from redundant attention and suboptimal multi-modal alignment. To address these limitations, we propose MLVTG, a novel framework that integrates two key modules: MambaAligner and LLMRefiner. MambaAligner uses stacked Vision Mamba blocks as a backbone instead of Transformers to model temporal dependencies and extract robust video representations for multi-modal alignment. LLMRefiner leverages the specific frozen layer of a pre-trained Large Language Model (LLM) to implicitly transfer semantic priors, enhancing multi-modal alignment without fine-tuning. This dual alignment strategy, temporal modeling via structured state-space dynamics and semantic purification via textual priors, enables more precise localization. Extensive experiments on QVHighlights, Charades-STA, and TVSum demonstrate that MLVTG achieves state-of-the-art performance and significantly outperforms existing baselines.

中文标题/摘要

标题：MLVTG：基于Mamba的特征对齐和LLM驱动的多模态视频时间定位净化

视频时间定位（VTG），旨在定位与自然语言查询对应的视频片段，是视频理解中的一个基本但具有挑战性的任务。现有的基于Transformer的方法往往受到冗余注意力和次优多模态对齐的困扰。为了解决这些限制，我们提出了一种名为MLVTG的新框架，该框架集成了两个关键模块：MambaAligner和LLMRefiner。MambaAligner使用堆叠的Vision Mamba块作为骨干，而不是Transformer，以建模时间依赖关系并提取用于多模态对齐的稳健视频表示。LLMRefiner利用预训练大型语言模型（LLM）的特定冻结层来隐式转移语义先验，增强多模态对齐而不进行微调。这种双重对齐策略，通过结构化的状态空间动力学进行时间建模和通过文本先验进行语义净化，能够实现更精确的定位。在QVHighlights、Charades-STA和TVSum上的广泛实验表明，MLVTG达到了最先进的性能，并显著优于现有基线。

Summary / 总结

The research aims to improve the accuracy of localizing video clips that correspond to natural language queries by addressing the limitations of existing Transformer-based methods. MLVTG proposes a novel framework with two key modules: MambaAligner and LLMRefiner. MambaAligner uses Vision Mamba blocks to model temporal dependencies and extract robust video representations, while LLMRefiner leverages a pre-trained Large Language Model to enhance multi-modal alignment. The experiments on QVHighlights, Charades-STA, and TVSum show that MLVTG outperforms existing methods in terms of localization precision.

研究旨在通过解决现有Transformer方法的局限性，提高自然语言查询匹配的视频片段定位准确性。MLVTG引入了MambaAligner和LLMRefiner模块。MambaAligner使用Vision Mamba块建模时间依赖关系并提取稳健的视频特征，而LLMRefiner利用预训练的大语言模型增强多模态对齐。实验表明，MLVTG在QVHighlights、Charades-STA和TVSum数据集上优于现有方法。

Generative Latent Alignment for Interpretable Radar Based Occupancy Detection in Ambient Assisted Living

Authors: Huy Trinh

First: 2026-01-27T18:06:51+00:00 · Latest: 2026-01-27T18:06:51+00:00

Abs · PDF · Code1 · Code2

Abstract

In this work, we study how to make mmWave radar presence detection more interpretable for Ambient Assisted Living (AAL) settings, where camera-based sensing raises privacy concerns. We propose a Generative Latent Alignment (GLA) framework that combines a lightweight convolutional variational autoencoder with a frozen CLIP text encoder to learn a low-dimensional latent representation of radar Range-Angle (RA) heatmaps. The latent space is softly aligned with two semantic anchors corresponding to "empty room" and "person present", and Grad-CAM is applied in this aligned latent space to visualize which spatial regions support each presence decision. On our mmWave radar dataset, we qualitatively observe that the "person present" class produces compact Grad-CAM blobs that coincide with strong RA returns, whereas "empty room" samples yield diffuse or no evidence. We also conduct an ablation study using unrelated text prompts, which degrades both reconstruction and localization, suggesting that radar-specific anchors are important for meaningful explanations in this setting.

中文标题/摘要

标题：基于生成潜在对齐的雷达占用检测在辅助生活中的可解释性

在本工作中，我们研究如何使毫米波雷达存在检测在辅助生活（AAL）环境中更具可解释性，其中基于摄像头的传感会引发隐私问题。我们提出了一种生成潜在对齐（GLA）框架，该框架结合了轻量级卷积变分自编码器和冻结的CLIP文本编码器，以学习雷达距离-角度（RA）热图的低维潜在表示。潜在空间通过两个语义锚点“空房间”和“有人存在”进行柔和对齐，并在对齐的潜在空间中应用Grad-CAM以可视化哪些空间区域支持每个存在决策。在我们的毫米波雷达数据集中，我们观察到“有人存在”类别的Grad-CAM斑块是紧凑的，并且与强烈的RA返回重合，而“空房间”样本则产生模糊或没有证据。我们还使用不相关的文本提示进行了消融研究，这降低了重建和定位性能，表明雷达特定的锚点对于此设置中的有意义解释很重要。

Will It Zero-Shot?: Predicting Zero-Shot Classification Performance For Arbitrary Queries

Authors: Kevin Robbins, Xiaotong Liu, Yu Wu, Le Sun, Grady McPeak, Abby Stylianou, Robert Pless

First: 2026-01-24T17:30:23+00:00 · Latest: 2026-01-27T18:04:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models like CLIP create aligned embedding spaces for text and images, making it possible for anyone to build a visual classifier by simply naming the classes they want to distinguish. However, a model that works well in one domain may fail in another, and non-expert users have no straightforward way to assess whether their chosen VLM will work on their problem. We build on prior work using text-only comparisons to evaluate how well a model works for a given natural language task, and explore approaches that also generate synthetic images relevant to that task to evaluate and refine the prediction of zero-shot accuracy. We show that generated imagery to the baseline text-only scores substantially improves the quality of these predictions. Additionally, it gives a user feedback on the kinds of images that were used to make the assessment. Experiments on standard CLIP benchmark datasets demonstrate that the image-based approach helps users predict, without any labeled examples, whether a VLM will be effective for their application.

中文标题/摘要

标题：零样本：预测任意查询的零样本分类性能

像CLIP这样的视觉-语言模型创建了文本和图像对齐的嵌入空间，使得任何人都可以通过简单地命名他们想要区分的类别来构建视觉分类器。然而，在一个领域表现良好的模型在另一个领域可能会失败，非专家用户没有直接的方法来评估他们选择的VLM是否适用于他们的问题。我们在此前仅使用文本比较的工作基础上，评估模型在给定自然语言任务中的表现，并探索生成与该任务相关的合成图像的方法来评估和改进零样本准确性的预测。我们展示了生成的图像相对于基线文本仅比较分数显著提高了这些预测的质量。此外，它还为用户提供反馈，说明了用于评估的图像类型。在标准CLIP基准数据集上的实验表明，基于图像的方法帮助用户在没有任何标记示例的情况下预测VLM是否适用于他们的应用。

Summary / 总结

The research aims to predict zero-shot classification performance for arbitrary queries using vision-language models like CLIP, which create aligned embedding spaces for text and images. The study explores text-only and image-based approaches to evaluate model performance, showing that generating synthetic images improves the prediction quality and provides users with feedback on the types of images used. Experiments on standard CLIP benchmark datasets demonstrate that the image-based method helps users assess the effectiveness of a VLM for their application without labeled examples.

研究旨在预测Vision-Language模型（VLM）在任意查询下的零样本分类性能，解决非专家用户评估模型适用性的难题。方法包括使用文本比较，并通过生成与任务相关的合成图像来改进预测。实验表明，结合生成的图像可以显著提高零样本准确性的预测质量，并为用户提供有关用于评估的图像类型的信息，使用户能够在无需标注样本的情况下更好地预测VLM是否适用于其特定应用。

TableMaster: A Recipe to Advance Table Understanding with Language Models

Authors: Lang Cao, Hanbing Liu

First: 2025-01-31T18:31:31+00:00 · Latest: 2026-01-27T18:04:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Tables serve as a fundamental format for representing structured relational data. While current language models (LMs) excel at many text-based tasks, they still face challenges in table understanding due to the complex characteristics of tabular data, such as their structured nature. In this paper, we aim to enhance LMs for improved table understanding. We identify four key challenges: 1) difficulty in locating target data, 2) deficiency in table semantics, 3) numerical inaccuracies in textual reasoning, and 4) semantic inflexibility in symbolic reasoning. To address these issues, we propose TableMaster, a recipe and comprehensive framework that integrates multiple solutions to overcome these obstacles. TableMaster first extracts relevant table content and verbalizes it with enriched semantic context. Additionally, we introduce adaptive reasoning, a flexible approach that dynamically adjusts between textual and symbolic reasoning, tailoring the reasoning process to each query. Extensive analyses and experiments demonstrate our findings and the effectiveness of TableMaster. On the WikiTQ dataset, TableMaster achieves an accuracy of 78.13% using GPT-4o-mini, surpassing existing baselines. We hope this work will serve as a practical step toward more robust and reliable table understanding.

中文标题/摘要

标题：TableMaster：一种利用语言模型提升表格理解的方法

表格是表示结构化关系数据的基本格式。尽管当前的语言模型（LMs）在许多文本任务上表现出色，但在表格理解方面仍面临挑战，因为表格数据具有复杂的特性，如其结构化性质。在本文中，我们旨在通过增强LMs来提高表格理解能力。我们确定了四个关键挑战：1）目标数据定位困难，2）表格语义不足，3）文本推理中的数值不准确，4）符号推理中的语义灵活性不足。为了解决这些问题，我们提出了TableMaster，这是一种综合框架，结合了多种解决方案以克服这些障碍。TableMaster首先提取相关表格内容，并以丰富的语义上下文进行表达化。此外，我们引入了适应性推理，这是一种灵活的方法，能够动态调整文本推理和符号推理之间的平衡，根据每个查询调整推理过程。广泛的分析和实验证明了我们的发现以及TableMaster的有效性。在WikiTQ数据集上，使用GPT-4o-mini，TableMaster的准确率为78.13%，超过了现有基线。我们希望这项工作能够成为更稳健和可靠的表格理解的实用步骤。

Summary / 总结

This paper aims to enhance language models for better table understanding by addressing four key challenges: locating target data, table semantics, numerical inaccuracies, and semantic inflexibility. The proposed TableMaster framework integrates solutions to extract and verbalize relevant table content with enriched semantics and introduces adaptive reasoning, which flexibly adjusts between textual and symbolic reasoning. Experiments show that TableMaster achieves 78.13% accuracy on the WikiTQ dataset using GPT-4o-mini, outperforming existing methods.

论文旨在通过解决四个关键问题来提高语言模型对表格的理解能力：目标数据的定位、表格语义的理解、文本推理中的数值不准确性和符号推理中的语义灵活性。为此，提出了TableMaster，它提取相关表格内容并用丰富的语义上下文进行表达。此外，还引入了适应性推理，可以根据查询动态切换文本和符号推理。实验结果显示，TableMaster 使用 GPT-4o-mini 在 WikiTQ 数据集上达到了 78.13% 的准确率，超过了现有方法。

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

Authors: Piotr Nawrot, Robert Li, Renjie Huang, Sebastian Ruder, Kelly Marchisio, Edoardo M. Ponti

First: 2025-04-24T17:39:25+00:00 · Latest: 2026-01-27T17:59:04+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its efficiency-accuracy trade-offs remain unclear due to the lack of comprehensive evaluation. We address this gap with the largest-scale empirical analysis to date of training-free sparse attention, evaluating six methods across multiple model families and sizes, sequences up to 128K tokens, and sparsity levels up to 0.95 (i.e., $1/20$ attention budget) on nine diverse tasks. We first organise the rapidly evolving landscape of sparse attention methods into a taxonomy along four design axes. Our analysis then yields actionable insights: 1) sparse attention is effective -- larger sparse models outperform smaller dense ones at equivalent cost, improving the Pareto frontier; 2) due to computational constraints, token-to-page importance estimation is unfeasible during prefilling, where the choice of an alternative solution (global-to-token or block-to-block) depends on the task, but is possible during decoding, enabling better generalisation and tolerance to higher sparsity; 3) longer sequences tolerate higher sparsity, suggesting that fixed-budget methods in production are suboptimal. Together, these findings provide practical guidance for deploying sparse attention and methodological recommendations for future evaluations. Our code is available at https://github.com/PiotrNawrot/sparse-frontier.

中文标题/摘要

标题：稀疏前沿：Transformer大模型中稀疏注意机制的效率-准确度权衡

稀疏注意机制为扩展Transformer大模型的长上下文能力提供了有希望的策略，但由于缺乏全面评估，其效率-准确度权衡尚不明确。我们通过迄今为止最大规模的经验分析填补了这一空白，评估了六种方法在多个模型家族和规模、最多128K个标记的序列以及高达0.95（即1/20的注意预算）的稀疏水平上的表现，共涉及九个不同的任务。我们首先按照四个设计轴将快速发展的稀疏注意方法分类。我们的分析提供了可操作的见解：1) 稀疏注意是有效的——较大的稀疏模型在同等成本下优于较小的密集模型，改善了帕累托前沿；2) 由于计算限制，在填充期间无法估计标记到页面的重要性，任务的不同决定了替代方案（全局到标记或块到块）的选择，但在解码期间是可行的，这有助于更好的泛化和对更高稀疏度的容忍；3) 较长的序列可以容忍更高的稀疏度，表明固定预算的方法在生产中可能是次优的。这些发现共同提供了部署稀疏注意的实际指导，并为未来评估提供了方法论建议。我们的代码可在https://github.com/PiotrNawrot/sparse-frontier/ 获取。

Summary / 总结

The study investigates the efficiency-accuracy trade-offs of sparse attention in Transformer LLMs by evaluating six sparse attention methods across various model sizes and sparsity levels on nine diverse tasks. The research finds that larger sparse models outperform smaller dense models at equivalent cost, and that the choice of token-to-page importance estimation during prefilling depends on the task but is feasible during decoding, which enhances generalization and tolerance to higher sparsity. Longer sequences can tolerate higher sparsity, suggesting that fixed-budget methods may not be optimal in production settings.

研究通过在多种模型规模和稀疏程度下评估六种稀疏注意机制方法，并在九种不同的任务上进行测试，探讨了稀疏注意在Transformer LLM中的效率-准确度权衡。研究发现，较大的稀疏模型在同等成本下优于较小的密集模型，并且在填充期间根据任务的不同，选择令牌到页面重要性估计的方法，但在解码期间是可行的，这增强了泛化能力和对更高稀疏度的容忍度。较长的序列可以容忍更高的稀疏度，表明固定预算的方法在生产环境中可能不是最优的。

EgoHandICL: Egocentric 3D Hand Reconstruction with In-Context Learning

Authors: Binzhu Xie, Shi Qiu, Sicheng Zhang, Yinqiao Wang, Hao Xu, Muzammal Naseer, Chi-Wing Fu, Pheng-Ann Heng

Venue: ICLR 2026

First: 2026-01-27T17:58:12+00:00 · Latest: 2026-01-27T17:58:12+00:00

Comments: Accepted in ICLR 2026, Codebase: https://github.com/Nicous20/EgoHandICL

Abs · PDF · Code1 · Code2 · Code3

Abstract

Robust 3D hand reconstruction in egocentric vision is challenging due to depth ambiguity, self-occlusion, and complex hand-object interactions. Prior methods mitigate these issues by scaling training data or adding auxiliary cues, but they often struggle in unseen contexts. We present EgoHandICL, the first in-context learning (ICL) framework for 3D hand reconstruction that improves semantic alignment, visual consistency, and robustness under challenging egocentric conditions. EgoHandICL introduces complementary exemplar retrieval guided by vision-language models (VLMs), an ICL-tailored tokenizer for multimodal context, and a masked autoencoder (MAE)-based architecture trained with hand-guided geometric and perceptual objectives. Experiments on ARCTIC and EgoExo4D show consistent gains over state-of-the-art methods. We also demonstrate real-world generalization and improve EgoVLM hand-object interaction reasoning by using reconstructed hands as visual prompts. Code and data: https://github.com/Nicous20/EgoHandICL

中文标题/摘要

标题：EgoHandICL：基于上下文学习的自视点三维手部重建

自视点视角下的稳健三维手部重建具有挑战性，由于深度模糊、自遮挡以及复杂的手部-物体交互。先前的方法通过扩大训练数据或添加辅助提示来缓解这些问题，但它们在未见过的场景中往往表现不佳。我们提出了EgoHandICL，这是首个用于三维手部重建的上下文学习（ICL）框架，能够提高语义对齐、视觉一致性以及在自视点挑战条件下的鲁棒性。EgoHandICL引入了由视觉语言模型（VLM）引导的补充示例检索、针对多模态上下文的ICL定制分词器以及基于掩码自编码器（MAE）的架构，该架构通过手部引导的几何和感知目标进行训练。在ARCTIC和EgoExo4D上的实验显示，EgoHandICL在最先进的方法上具有持续的改进。我们还展示了其实用场景下的泛化能力，并通过使用重建的手部作为视觉提示来改进EgoVLM对手部-物体交互的推理。

Summary / 总结

EgoHandICL is an in-context learning framework for 3D hand reconstruction in egocentric vision, addressing depth ambiguity, self-occlusion, and hand-object interactions. It uses complementary exemplar retrieval guided by vision-language models, an ICL-tailored tokenizer, and a masked autoencoder architecture. Experiments on ARCTIC and EgoExo4D show consistent improvements over existing methods. Real-world generalization and enhanced hand-object interaction reasoning in EgoVLM are also demonstrated.

EgoHandICL通过引入一种上下文学习框架来解决在第一人称视角下3D手部重建的挑战，该框架增强了语义对齐和视觉一致性。它使用由视觉语言模型引导的互补示例检索、一种针对多模态上下文的定制化分词器以及基于掩码自编码器的架构。在ARCTIC和EgoExo4D上的实验显示，该方法在现有方法上的一致改进。还展示了其实用场景下的泛化能力和对手物交互推理的改进。相关代码和数据可在提供的GitHub链接中获取。

Identifying and Transferring Reasoning-Critical Neurons: Improving LLM Inference Reliability via Activation Steering

Authors: Fangan Dong, Zuming Yan, Xuri Ge, Zhiwei Xu, Mengqi Zhang, Xuanang Chen, Ben He, Xin Xin, Zhumin Chen, Ying Zhou

First: 2026-01-27T17:53:01+00:00 · Latest: 2026-01-27T17:53:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite the strong reasoning capabilities of recent large language models (LLMs), achieving reliable performance on challenging tasks often requires post-training or computationally expensive sampling strategies, limiting their practical efficiency. In this work, we first show that a small subset of neurons in LLMs exhibits strong predictive correlations with reasoning correctness. Based on this observation, we propose AdaRAS (Adaptive Reasoning Activation Steering), a lightweight test-time framework that improves reasoning reliability by selectively intervening on neuron activations. AdaRAS identifies Reasoning-Critical Neurons (RCNs) via a polarity-aware mean-difference criterion and adaptively steers their activations during inference, enhancing incorrect reasoning traces while avoiding degradation on already-correct cases. Experiments on 10 mathematics and coding benchmarks demonstrate consistent improvements, including over 13% gains on AIME-24 and AIME-25. Moreover, AdaRAS exhibits strong transferability across datasets and scalability to stronger models, outperforming post-training methods without additional training or sampling cost.

中文标题/摘要

标题：识别和转移推理关键神经元：通过激活导向提高大语言模型推理可靠性

尽管最近的大语言模型（LLMs）具有强大的推理能力，但在实现可靠性能时，通常需要后训练或计算成本高昂的采样策略，这限制了它们的实际效率。在本文中，我们首先展示了LLMs中一小部分神经元与推理正确性之间存在强烈的预测相关性。基于这一观察，我们提出了AdaRAS（自适应推理激活导向），这是一种轻量级的测试时框架，通过选择性干预神经元激活来提高推理可靠性。AdaRAS通过极性感知的均值差标准识别推理关键神经元（RCNs），并在推理过程中适应性地引导其激活，增强错误推理痕迹，同时避免对已经正确的案例造成退化。在10个数学和编程基准上的实验表明，该方法具有一致的改进，包括在AIME-24和AIME-25上超过13%的提升。此外，AdaRAS在不同数据集之间表现出强大的可转移性，并且可以扩展到更强的模型，优于无需额外训练或采样成本的后训练方法。

Summary / 总结

This work addresses the challenge of achieving reliable performance in large language models (LLMs) on reasoning tasks, which often require post-training or expensive sampling strategies. The authors identify a subset of neurons that are strongly correlated with reasoning correctness and propose AdaRAS, a lightweight framework that selectively intervenes on these neurons during inference to improve reasoning reliability. Experiments show consistent improvements across various benchmarks, with over 13% gains on AIME-24 and AIME-25, and strong transferability and scalability to stronger models.

该研究旨在解决大型语言模型（LLMs）在推理任务中实现可靠性能的挑战，通常需要后训练或昂贵的采样策略。作者发现了一部分与推理正确性高度相关的神经元，并提出了一种轻量级框架AdaRAS，在推理过程中选择性地干预这些神经元以提高推理可靠性。实验结果显示在各种基准测试中的一致改进，包括在AIME-24和AIME-25上超过13%的提升，并且具有强大的跨数据集转移性和对更强模型的可扩展性。

HARMONI: Multimodal Personalization of Multi-User Human-Robot Interactions with LLMs

Authors: Jeanne Malécot, Hamed Rahimi, Jeanne Cattoni, Marie Samson, Mouad Abrini, Mahdi Khoramshahi, Maribel Pino, Mohamed Chetouani

First: 2026-01-27T17:45:04+00:00 · Latest: 2026-01-27T17:45:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Existing human-robot interaction systems often lack mechanisms for sustained personalization and dynamic adaptation in multi-user environments, limiting their effectiveness in real-world deployments. We present HARMONI, a multimodal personalization framework that leverages large language models to enable socially assistive robots to manage long-term multi-user interactions. The framework integrates four key modules: (i) a perception module that identifies active speakers and extracts multimodal input; (ii) a world modeling module that maintains representations of the environment and short-term conversational context; (iii) a user modeling module that updates long-term speaker-specific profiles; and (iv) a generation module that produces contextually grounded and ethically informed responses. Through extensive evaluation and ablation studies on four datasets, as well as a real-world scenario-driven user-study in a nursing home environment, we demonstrate that HARMONI supports robust speaker identification, online memory updating, and ethically aligned personalization, outperforming baseline LLM-driven approaches in user modeling accuracy, personalization quality, and user satisfaction.

中文标题/摘要

标题：HARMONI：利用大语言模型实现多用户人机交互的多模态个性化

现有的人机交互系统往往缺乏在多用户环境中持续个性化和动态适应的机制，限制了其在实际部署中的有效性。我们提出了HARMONI，一种利用大语言模型的多模态个性化框架，使社会辅助机器人能够管理长期的多用户交互。该框架整合了四个关键模块：(i) 感知模块，识别活跃说话者并提取多模态输入；(ii) 世界建模模块，维护环境和短期对话上下文的表示；(iii) 用户建模模块，更新长期特定说话者的个人资料；以及(iv) 生成模块，生成上下文相关且伦理导向的响应。通过在四个数据集上的广泛评估和消融研究，以及在养老院环境中基于实际场景的用户研究，我们证明HARMONI支持稳健的说话者识别、在线记忆更新和伦理导向的个性化，其用户建模准确性、个性化质量和用户满意度均优于基线的大语言模型驱动方法。

Summary / 总结

HARMONI is a multimodal personalization framework for human-robot interactions that uses large language models to enable socially assistive robots to handle long-term multi-user interactions. It includes modules for perception, world modeling, user modeling, and response generation. Extensive evaluations and user studies show that HARMONI enhances speaker identification, memory updating, and personalization quality, surpassing baseline approaches in user modeling accuracy and satisfaction.

HARMONI 是一个用于人类-机器人交互的多模态个性化框架，利用大型语言模型来管理长期的多用户交互。它包括感知、世界建模、用户建模和生成模块。广泛的评估表明，HARMONI 提高了说话人的识别、在线记忆更新和伦理导向的个性化，其用户建模准确性、个性化质量和用户满意度均优于基线方法。

Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

Authors: Jinyeop Song, Song Wang, Julian Shun, Yada Zhu

First: 2025-09-30T15:14:24+00:00 · Latest: 2026-01-27T17:44:43+00:00

Comments: Wrong numbers are reported for main results

Abs · PDF · Code1 · Code2 · Code3

Abstract

Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucinations and expose reasoning traces. However, many KG-RAG systems compose multiple LLM modules (e.g planning, reasoning, and responding), inflating inference cost and binding behavior to a specific target KG. To address this, we introduce KG-R1, an agentic KG retrieval-augmented generation (KG-RAG) framework through reinforcement learning (RL). KG-R1 utilizes a single agent that interacts with KGs as its environment, learning to retrieve at each step and incorporating the retrieved information into its reasoning and generation. The process is optimized through end-to-end RL. In controlled experiments across Knowledge-Graph Question Answering (KGQA) benchmarks, our method demonstrates both efficiency and transferability: Using Qwen-2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use larger foundation or fine-tuned models. Furthermore, KG-R1 enables plug and play: after training, it maintains strong accuracy on new KGs without modification. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at https://github.com/Jinyeop3110/KG-R1.

中文标题/摘要

标题：通过强化学习实现高效且可迁移的代理知识图谱RAG

知识图谱检索增强生成（KG-RAG）将大型语言模型（LLMs）与结构化、可验证的知识图谱（KGs）结合，以减少幻觉并暴露推理痕迹。然而，许多KG-RAG系统组合了多个LLM模块（如规划、推理和响应），增加了推理成本并将其行为绑定到特定的目标KG。为了解决这个问题，我们引入了KG-R1，这是一种通过强化学习（RL）实现的代理KG检索增强生成（KG-RAG）框架。KG-R1 使用一个代理与KGs 交互作为其环境，在每一步中学习检索并将其检索的信息融入其推理和生成中。该过程通过端到端的RL进行优化。在知识图谱问答（KGQA）基准测试中的受控实验中，我们的方法展示了高效性和可迁移性：使用Qwen-2.5-3B，KG-R1 以比使用更大基础模型或微调模型的多模块工作流程方法更少的生成标记提高了答案准确性。此外，KG-R1 具有即插即用功能：在训练后，它在新的KG上保持了强大的准确性而无需修改。这些特性使KG-R1 成为实际部署中具有前景的KG-RAG框架。我们的代码可在 https://github.com/Jinyeop3110/KG-R1 公开获取。

Summary / 总结

The research aims to address the inefficiency and lack of transferability in Knowledge-Graph retrieval-augmented generation (KG-RAG) systems by introducing KG-R1, which uses reinforcement learning to enable a single agent to interact with knowledge graphs and incorporate retrieved information into reasoning and generation. The method shows improved answer accuracy with fewer generation tokens compared to multi-module approaches using larger models, and it is transferable to new knowledge graphs without modification, making it a promising framework for real-world deployment.

KG-R1 是一个使用强化学习与知识图谱交互的高效且可移植的 KG-RAG 框架，相比多模块方法，它以更少的生成令牌提高了答案准确性，并且在 KGQA 基准测试中表现出色，可以轻松适应新的知识图谱而不需重新训练，适用于实际部署。

PROPHET: An Inferable Future Forecasting Benchmark with Causal Intervened Likelihood Estimation

Authors: Zhengwei Tao, Pu Wu, Zhi Jin, Xiaoying Bai, Haiyan Zhao, Chengfeng Dou, Xiancai Chen, Jia Li, Linyu Li, Chongyang Tao, Wentao Zhang

First: 2025-04-02T08:57:42+00:00 · Latest: 2026-01-27T17:41:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Predicting future events based on news on the Web stands as one of the ultimate aspirations of artificial intelligence. Recent advances in large language model (LLM)-based systems have shown remarkable potential in forecasting future events, thereby garnering significant interest in the research community. Currently, several benchmarks have been established to evaluate the forecasting capabilities by formalizing the event prediction as a retrieval-augmented generation (RAG)-and-reasoning task. In these benchmarks, each prediction question is answered with relevant retrieved news articles downloaded from the Web. However, because there is no consideration of whether the questions can be supported by valid or sufficient supporting rationales, some of the questions in these benchmarks may be inherently noninferable. To address this issue, we introduce a new benchmark, PROPHET, which comprises inferable forecasting questions paired with relevant news for retrieval. To ensure the inferability of the benchmark, we propose Causal Intervened Likelihood (CIL), a statistical measure that assesses inferability through causal inference. In constructing this benchmark, we first collected recent trend forecasting questions, and then filtered the data using CIL resulting in an inferable benchmark for future forecasting. Through extensive experiments, we first demonstrate the validity of CIL and in-depth investigations into future forecasting with the aid of CIL. Subsequently, we evaluate several representative prediction methods on PROPHET. The overall results draws valuable insights for task of future directions.

中文标题/摘要

标题：PROPHET：基于因果干预似然估计的可推断未来预测基准

基于网络新闻预测未来事件一直是人工智能的终极目标之一。基于大型语言模型（LLM）的系统在预测未来事件方面表现出显著潜力，因此在研究界引起了广泛关注。目前，已经建立了一些基准来评估预测能力，将事件预测形式化为检索增强生成（RAG）和推理任务。在这些基准中，每个预测问题都使用从网络下载的相关新闻文章来回答。然而，由于没有考虑问题是否可以由有效的或足够的支持理由支持，这些基准中的某些问题可能是不可推断的。为了解决这一问题，我们引入了一个新的基准PROPHET，它包括与相关新闻配对的可推断预测问题。为了确保基准的可推断性，我们提出了因果干预似然（CIL），这是一种通过因果推理评估可推断性的统计度量。在构建此基准时，我们首先收集了最近的趋势预测问题，然后使用CIL过滤数据，从而得到一个用于未来预测的可推断基准。通过广泛的实验，我们首先证明了CIL的有效性，并深入探讨了使用CIL辅助的未来预测。随后，我们在PROPHET上评估了几种代表性预测方法。总体结果为未来方向的任务提供了宝贵的见解。

Summary / 总结

The research aims to improve the ability of artificial intelligence to predict future events based on web news. It introduces a new benchmark called PROPHET, which uses causal intervened likelihood (CIL) to ensure that the questions can be supported by valid rationales. The study demonstrates the effectiveness of CIL and evaluates several prediction methods on this benchmark, providing valuable insights for future research.

研究旨在提高人工智能根据网络新闻预测未来事件的能力。它引入了一个名为PROPHET的新基准，使用因果干预似然（CIL）来确保问题可以由有效的支持论据支持。通过广泛的实验，研究证明了CIL的有效性，并在PROPHET上评估了几种预测方法，为未来的研究方向提供了有价值的见解。

Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

Authors: Jialong Wu, Xiaoying Zhang, Hongyi Yuan, Xiangcheng Zhang, Tianhao Huang, Changjing He, Chaoyi Deng, Renrui Zhang, Youbin Wu, Mingsheng Long

First: 2026-01-27T17:40:07+00:00 · Latest: 2026-01-27T17:40:07+00:00

Comments: Project page: https://thuml.github.io/Reasoning-Visual-World

Abs · PDF · Code1 · Code2 · Project1

Abstract

Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks--particularly those grounded in the physical world--visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of CoT reasoning and analyze distinctions among different forms of world models. Empirically, we identify tasks that necessitate interleaved visual-verbal CoT reasoning, constructing a new evaluation suite, VisWorld-Eval. Controlled experiments on a state-of-the-art UMM show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling, but offers no clear advantage otherwise. Together, this work clarifies the potential of multimodal world modeling for more powerful, human-like multimodal AI.

中文标题/摘要

标题：视觉生成解锁多模态世界模型中的类人推理

人类构建内部世界模型并通过操作这些模型中的概念进行推理。近年来，特别是在链式思考（CoT）推理方面取得的AI进展，近似了人类的认知能力，其中认为世界模型嵌入在大型语言模型中。当前系统在数学和编程等正式和抽象领域中达到了专家级表现，主要依赖于语言推理。然而，在物理和空间智能等领域，它们仍然远远落后于人类，这些领域需要更丰富的表示和先验知识。因此，能够同时进行语言和视觉生成的统一多模态模型（UMMs）的出现引发了对基于互补多模态路径的更类人推理的兴趣，尽管其优势尚不明确。从世界模型的角度来看，本文首次系统研究了视觉生成何时以及如何促进推理。我们的核心观点是视觉优越性假设：对于某些任务——特别是那些基于物理世界的任务——视觉生成更自然地充当世界模型，而纯粹的语言世界模型则会遇到由于表示限制或缺乏先验知识而产生的瓶颈。理论上，我们将内部世界建模作为CoT推理的核心组成部分进行形式化，并分析不同形式世界模型之间的区别。实验上，我们确定了需要交错视觉-语言CoT推理的任务，构建了一个新的评估套件VisWorld-Eval。在最先进的UMM上的受控实验表明，交错CoT在有利于视觉世界建模的任务中显著优于纯粹的语言CoT，但在其他情况下没有明显优势。综上所述，这项工作阐明了多模态世界建模在更强大、更类人的多模态AI中的潜力。

Summary / 总结

This paper explores how visual generation enhances reasoning in multimodal world models, addressing the limitations of purely verbal reasoning in physical and spatial tasks. The authors introduce the visual superiority hypothesis, which posits that visual generation is more effective for tasks grounded in the physical world. They develop a new evaluation suite, VisWorld-Eval, and demonstrate that interleaved visual-verbal chain-of-thought reasoning significantly outperforms purely verbal reasoning on tasks requiring visual world modeling, while offering no clear advantage otherwise.

该论文探讨了视觉生成如何增强多模态世界模型中的推理能力，解决了纯语言推理在物理和空间任务中的局限性。研究提出了视觉优越性假设，认为视觉生成更适合处理与物理世界相关的任务。通过开发新的评估套件VisWorld-Eval，作者证明了结合视觉和语言的链式思考推理在需要视觉世界建模的任务上优于纯语言推理，但在其他任务上没有明显优势。

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

Authors: Mahdi Astaraki, Mohammad Arshi Saloot, Ali Shiraee Kasmaee, Hamidreza Mahyar, Soheila Samiee

First: 2026-01-27T17:35:05+00:00 · Latest: 2026-01-27T17:35:05+00:00

Comments: 27 pages, 15 figures

Abs · PDF · Code1 · Code2

Abstract

Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions requiring genuine retrieval and analyze behavior with diagnostics spanning retrieval coverage gaps, anchor-carry drop, query quality, composition fidelity, and control calibration. Across models, Iterative RAG consistently outperforms Gold Context, with gains up to 25.6 percentage points, especially for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures, mitigates context overload, and enables dynamic correction of early hypothesis drift, but remaining failure modes include incomplete hop coverage, distractor latch trajectories, early stopping miscalibration, and high composition failure rates even with perfect retrieval. Overall, staged retrieval is often more influential than the mere presence of ideal evidence; we provide practical guidance for deploying and diagnosing RAG systems in specialized scientific settings and a foundation for more reliable, controllable iterative retrieval-reasoning frameworks.

中文标题/摘要

标题：当迭代RAG超越理想证据时：科学多跳问答中的诊断研究

检索增强生成（RAG）将大型语言模型（LLMs）扩展到参数化知识之外，但尚不清楚何时迭代检索-推理循环在意义上优于静态RAG，特别是在具有多跳推理、稀疏领域知识和异构证据的科学领域。我们提供了第一个受控的、机制层面的诊断研究，探讨同步迭代检索和推理是否能超越理想化的静态上限（黄金上下文）RAG。我们以三个范式对十一个最先进的LLMs进行了基准测试：（i）无上下文，衡量对参数化记忆的依赖；（ii）黄金上下文，所有先验证据一次性提供；（iii）迭代RAG，一个无需训练的控制器，交替进行检索、假设细化和证据感知停止。使用化学重点的ChemKGMultiHopQA数据集，我们隔离了需要真正检索的问题，并通过检索覆盖率差距、锚点携带丢失、查询质量、组合保真度和控制校准等诊断分析了行为。在所有模型中，迭代RAG始终优于黄金上下文，增幅高达25.6个百分点，尤其是对于非推理微调模型。分阶段检索减少了晚期跳失败，缓解了上下文过载，并允许动态纠正早期假设漂移，但剩余的失败模式包括不完整的跳覆盖、干扰物锁定轨迹、早期停止校准不当以及即使在完美检索的情况下也有较高的组合失败率。总体而言，分阶段检索往往比理想证据的存在更具影响力；我们提供了在专门的科学环境中部署和诊断RAG系统的实用指导，并为更可靠、可控的迭代检索-推理框架奠定了基础。

Summary / 总结

This study investigates when iterative retrieval-reasoning loops in RAG outperform static RAG, especially in scientific domains requiring multi-hop reasoning. Using the ChemKGMultiHopQA dataset, the research compares eleven state-of-the-art LLMs under three regimes: no context, gold context, and iterative RAG. Iterative RAG consistently outperforms the gold context, with gains up to 25.6 percentage points, particularly for non-reasoning fine-tuned models. Staged retrieval reduces late-hop failures and context overload but faces challenges like incomplete hop coverage and high composition failure rates.

该研究探讨了在科学领域需要多跳推理时，迭代检索-推理循环何时能超越静态RAG。使用ChemKGMultiHopQA数据集，研究比较了十一个最先进的LLM在三种模式下的表现：无上下文、理想证据上下文和迭代RAG。迭代RAG在所有模型中都优于理想证据上下文，增幅最高可达25.6个百分点，尤其是对于非推理微调模型。阶段检索减少了晚期跳失败和上下文过载，但仍面临如不完整跳覆盖和高组合失败率等挑战。

APEX-Agents

Authors: Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, Neel Venugopal, Alannah Hsia, Isaac Robinson, Calix Huang, Olivia Varones, Daniyal Khan, Michael Haines, Zach Richards, Chirag Mahapatra, Brendan Foody, Osvald Nitski

First: 2026-01-20T18:53:44+00:00 · Latest: 2026-01-27T17:31:16+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce the AI Productivity Index for Agents (APEX-Agents), a benchmark for assessing whether AI agents can execute long-horizon, cross-application tasks created by investment banking analysts, management consultants, and corporate lawyers. APEX-Agents requires agents to navigate realistic work environments with files and tools. We test eight agents for the leaderboard using Pass@1. Gemini 3 Flash (Thinking=High) achieves the highest score of 24.0%, followed by GPT-5.2 (Thinking=High), Claude Opus 4.5 (Thinking=High), and Gemini 3 Pro (Thinking=High). We open source the APEX-Agents benchmark (n=480) with all prompts, rubrics, gold outputs, files, and metadata. We also open-source Archipelago, our infrastructure for agent execution and evaluation.

中文标题/摘要

标题：APEX-Agents

我们介绍了代理人工智能生产力指数（APEX-Agents），这是一个基准测试，用于评估AI代理是否能够执行由投资银行分析师、管理咨询顾问和公司律师创建的长期跨应用任务。APEX-Agents 要求代理在包含文件和工具的现实工作环境中导航。我们使用 Pass@1 测试了八种代理以确定排行榜。Gemini 3 Flash（思考=高）获得最高分为 24.0%，其次是 GPT-5.2（思考=高）、Claude Opus 4.5（思考=高）和 Gemini 3 Pro（思考=高）。我们开源了包含 480 个提示、评分标准、黄金输出、文件和元数据的 APEX-Agents 基准测试。我们还开源了我们的代理执行和评估基础设施 Archipelago。

Summary / 总结

The research introduces APEX-Agents, a benchmark to evaluate AI agents' ability to perform long-term, cross-application tasks as done by investment banking analysts, management consultants, and corporate lawyers. The study tests eight agents using Pass@1 metric, with Gemini 3 Flash achieving the highest score of 24.0%, followed by GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro. The benchmark includes 480 tasks with all necessary materials open-sourced, along with Archipelago, the infrastructure for agent execution and evaluation.

研究引入了APEX-Agents基准，用于评估AI代理执行投资银行分析师、管理咨询师和公司律师等跨应用长期任务的能力。研究使用Pass@1指标测试了八个代理，Gemini 3 Flash获得最高分24.0%，其次是GPT-5.2、Claude Opus 4.5和Gemini 3 Pro。基准包含480个任务，所有必要材料均已开源，同时开源了用于代理执行和评估的基础设施Archipelago。

Routing End User Queries to Enterprise Databases

Authors: Saikrishna Sudarshan, Tanay Kulkarni, Manasi Patwardhan, Lovekesh Vig, Ashwin Srinivasan, Tanmay Tulsidas Verlekar

First: 2026-01-27T17:30:19+00:00 · Latest: 2026-01-27T17:30:19+00:00

Comments: 6 pages, 2 figures

Abs · PDF · Code1 · Code2

Abstract

We address the task of routing natural language queries in multi-database enterprise environments. We construct realistic benchmarks by extending existing NL-to-SQL datasets. Our study shows that routing becomes increasingly challenging with larger, domain-overlapping DB repositories and ambiguous queries, motivating the need for more structured and robust reasoning-based solutions. By explicitly modelling schema coverage, structural connectivity, and fine-grained semantic alignment, the proposed modular, reasoning-driven reranking strategy consistently outperforms embedding-only and direct LLM-prompting baselines across all the metrics.

中文标题/摘要

标题：将用户查询路由至企业数据库

我们解决了在多数据库企业环境中路由自然语言查询的任务。我们通过扩展现有的NL-to-SQL数据集构建了现实基准。研究表明，随着数据库仓库规模的增大和领域重叠以及查询的模糊性，路由变得越来越具有挑战性，这促使需要更结构化和稳健的基于推理的解决方案。通过明确建模模式覆盖、结构连接性和细粒度语义对齐，所提出的模块化、基于推理的重排序策略在所有指标上都优于仅基于嵌入和直接LLM提示的基线。

Summary / 总结

The paper addresses the challenge of routing natural language queries to appropriate enterprise databases. It constructs realistic benchmarks by expanding existing datasets and demonstrates that routing becomes more difficult with larger, domain-overlapping databases and ambiguous queries. The authors propose a modular, reasoning-driven reranking strategy that models schema coverage, structural connectivity, and semantic alignment, which outperforms embedding-only and direct LLM-prompting methods across various metrics.

该研究解决了将自然语言查询路由到企业数据库中的挑战。通过扩展现有数据集构建了现实基准，并表明随着数据库规模的增大和领域重叠以及查询的模糊性，路由变得越来越困难。研究提出了一种模块化、基于推理的重排序策略，该策略模型了模式覆盖、结构连接性和语义对齐，并在各种指标上优于仅基于嵌入和直接LLM提示的方法。

An Interpretable Recommendation Model for Psychometric Data, With an Application to Gerontological Primary Care

Authors: Andre Paulino de Lima, Paula Castro, Suzana Carvalho Vaz de Andrade, Rosa Maria Marcucci, Ruth Caldeira de Melo, Marcelo Garcia Manzato

First: 2026-01-27T17:29:21+00:00 · Latest: 2026-01-27T17:29:21+00:00

Comments: 81 pages, 19 figures, 3 annexes

Abs · PDF · Code1 · Code2

Abstract

There are challenges that must be overcome to make recommender systems useful in healthcare settings. The reasons are varied: the lack of publicly available clinical data, the difficulty that users may have in understanding the reasons why a recommendation was made, the risks that may be involved in following that recommendation, and the uncertainty about its effectiveness. In this work, we address these challenges with a recommendation model that leverages the structure of psychometric data to provide visual explanations that are faithful to the model and interpretable by care professionals. We focus on a narrow healthcare niche, gerontological primary care, to show that the proposed recommendation model can assist the attending professional in the creation of personalised care plans. We report results of a comparative offline performance evaluation of the proposed model on healthcare datasets that were collected by research partners in Brazil, as well as the results of a user study that evaluates the interpretability of the visual explanations the model generates. The results suggest that the proposed model can advance the application of recommender systems in this healthcare niche, which is expected to grow in demand , opportunities, and information technology needs as demographic changes become more pronounced.

中文标题/摘要

标题：一种用于心理测量数据的可解释推荐模型及其在老年医学初级护理中的应用

在医疗保健环境中使推荐系统有用存在诸多挑战。原因多样：缺乏公开的临床数据、用户可能难以理解推荐的原因、遵循推荐可能涉及的风险以及对其有效性的不确定性。在本文中，我们通过利用心理测量数据的结构来提供忠实于模型且可由护理专业人员解释的可视化解释，来应对这些挑战。我们专注于老年医学初级护理这一医疗保健细分领域，以展示所提出的推荐模型如何帮助主治专业人员制定个性化护理计划。我们报告了在巴西研究合作伙伴收集的医疗保健数据集上对所提模型进行的比较离线性能评估结果，以及对模型生成的可视化解释的可解释性进行的用户研究结果。结果表明，所提出的模型可以推动该医疗保健细分领域中推荐系统的应用，随着人口结构变化的加剧，该领域预计会增加需求、机会和信息技术需求。

Summary / 总结

This study addresses challenges in using recommender systems in healthcare, particularly in gerontological primary care, by developing an interpretable recommendation model that provides visual explanations. The model leverages psychometric data structure to ensure explanations are faithful to the model and understandable by care professionals. Experimental results from offline performance evaluations and user studies indicate that the model can assist in creating personalized care plans and improve the application of recommender systems in this healthcare niche, which is expected to grow due to demographic changes.

该研究通过开发一个可解释的推荐模型来解决在医疗保健领域，尤其是老年初级护理中使用推荐系统的挑战，该模型利用心理测量数据结构提供视觉解释，确保解释忠实于模型并可由护理专业人员理解。实验结果来自离线性能评估和用户研究显示，该模型可以协助制定个性化护理计划，并改善推荐系统在这一医疗保健领域的应用，预计随着人口结构变化的加剧，这一领域的需求、机会和信息技术需求将增长。

Assessing the Effectiveness of Deep Embeddings for Tree Species Classification in the Dutch Forest Inventory

Authors: Takayuki Ishikawa, Carmelo Bonannella, Bas J. W. Lerink, Marc Rußwurm

First: 2025-08-26T09:06:14+00:00 · Latest: 2026-01-27T17:25:21+00:00

Abs · PDF · Code1 · Code2

Abstract

National Forest Inventory serves as the primary source of forest information, however, maintaining these inventories requires labor-intensive on-site campaigns by forestry experts to identify and document tree species. Embeddings from deep pre-trained remote sensing models offer new opportunities to update NFIs more frequently and at larger scales. While training new deep learning models on few data points remains challenging, we show that using pre-computed embeddings can proven effective for distinguishing tree species through seasonal canopy reflectance patternsin combination with Random Forest. This work systematically investigates how deep embeddings improve tree species classification accuracy in the Netherlands with few annotated data. We evaluate this question on three embedding models: Presto, Alpha Earth, and Tessera, using three tree species datasets of varying difficulty. Data-wise, we compare the available embeddings from Alpha Earth and Tessera with dynamically calculated embeddings from a pre-trained Presto model. Our results demonstrate that fine-tuning a publicly available remote sensing time series pre-trained model outperforms the current state-of-the-art in NFI classification in the Netherlands, yielding performance gains of approximately 2-9 percentage points across datasets and evaluation metrics. This indicates that classic hand-defined features are too simple for this task and highlights the potential of using deep embeddings for data-limited applications such as NFI classification. By leveraging openly available satellite data and deep embeddings from pre-trained models, this approach significantly improves classification accuracy compared to traditional methods and can effectively complement existing forest inventory processes.

中文标题/摘要

标题：评估深度嵌入在荷兰森林资源清查中树种分类有效性

国家森林清查是主要的森林信息来源，然而，维护这些清查需要林业专家进行劳动密集型的现场活动来识别和记录树种。来自深度预训练遥感模型的嵌入为更频繁和更大规模地更新NFIs提供了新机会。虽然在少量数据点上训练新的深度学习模型仍然具有挑战性，但我们展示了使用预计算嵌入通过季节性冠层反射模式与随机森林结合区分树种的有效性。本研究系统地调查了在荷兰使用深度嵌入如何提高树种分类准确性，特别是在少量标注数据的情况下。我们使用三种嵌入模型：Presto、Alpha Earth和Tessera，以及三种不同难度的树种数据集来评估这个问题。数据上，我们将Alpha Earth和Tessera提供的可用嵌入与预训练Presto模型动态计算的嵌入进行了比较。我们的结果表明，微调一个公开可用的遥感时间序列预训练模型在荷兰的NFI分类中优于当前最先进的技术，各数据集和评估指标的性能提升约为2-9个百分点。这表明经典的手动定义特征过于简单，突显了在数据受限的应用如NFI分类中使用深度嵌入的潜力。通过利用公开可用的卫星数据和预训练模型的深度嵌入，这种方法与传统方法相比显著提高了分类准确性，并且可以有效补充现有的森林清查过程。

Summary / 总结

This study evaluates the effectiveness of deep embeddings for tree species classification in the Dutch Forest Inventory, aiming to reduce the labor-intensive on-site identification process. Using pre-trained remote sensing models, the research demonstrates that fine-tuning Presto embeddings outperforms current methods, achieving performance gains of 2-9 percentage points. This indicates the potential of deep embeddings for improving classification accuracy in data-limited scenarios.

该研究评估了深度嵌入在荷兰森林库存中树木物种分类的有效性，使用预训练的遥感模型。通过微调一个公开可用的Presto模型并与Alpha Earth和Tessera嵌入进行比较，研究显示在不同数据集和指标上的性能提升约为2-9个百分点。这表明深度嵌入在数据有限的情况下可以显著提高分类准确性，相比传统方法具有明显优势。

Query-Guided Spatial-Temporal-Frequency Interaction for Music Audio-Visual Question Answering

Authors: Kun Li, Michael Ying Yang, Sami Sebastian Brandt

First: 2026-01-27T17:24:32+00:00 · Latest: 2026-01-27T17:24:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Audio--Visual Question Answering (AVQA) is a challenging multimodal task that requires jointly reasoning over audio, visual, and textual information in a given video to answer natural language questions. Inspired by recent advances in Video QA, many existing AVQA approaches primarily focus on visual information processing, leveraging pre-trained models to extract object-level and motion-level representations. However, in those methods, the audio input is primarily treated as complementary to video analysis, and the textual question information contributes minimally to audio--visual understanding, as it is typically integrated only in the final stages of reasoning. To address these limitations, we propose a novel Query-guided Spatial--Temporal--Frequency (QSTar) interaction method, which effectively incorporates question-guided clues and exploits the distinctive frequency-domain characteristics of audio signals, alongside spatial and temporal perception, to enhance audio--visual understanding. Furthermore, we introduce a Query Context Reasoning (QCR) block inspired by prompting, which guides the model to focus more precisely on semantically relevant audio and visual features. Extensive experiments conducted on several AVQA benchmarks demonstrate the effectiveness of our proposed method, achieving significant performance improvements over existing Audio QA, Visual QA, Video QA, and AVQA approaches. The code and pretrained models will be released after publication.

中文标题/摘要

标题：基于查询的空间-时间-频率交互技术在音乐音频-视觉问答中的应用

音频-视觉问答（AVQA）是一个具有挑战性的多模态任务，需要在给定的视频中联合推理音频、视觉和文本信息以回答自然语言问题。受视频问答（Video QA）近期进展的启发，许多现有的AVQA方法主要集中在视觉信息处理上，利用预训练模型提取对象级和运动级表示。然而，在这些方法中，音频输入主要被视为视频分析的补充，文本问题信息对音频-视觉理解的贡献很少，通常仅在推理的最后阶段进行整合。为了解决这些局限性，我们提出了一种新颖的查询导向的空间-时间-频率（QSTar）交互方法，该方法有效地结合了问题导向的线索，并利用音频信号的独特频域特征，以及空间和时间感知，以增强音频-视觉理解。此外，我们引入了一个灵感来源于提示的查询上下文推理（QCR）模块，该模块引导模型更精确地关注语义相关的音频和视觉特征。在多个AVQA基准上的广泛实验表明，我们提出的方法具有显著的效果，相对于现有的音频问答（Audio QA）、视觉问答（Visual QA）、视频问答（Video QA）和AVQA方法，实现了显著的性能提升。代码和预训练模型将在发表后发布。

Summary / 总结

The paper addresses the limitations of existing AVQA methods that primarily focus on visual information and underutilize audio and textual inputs. It proposes a QSTar method that incorporates question-guided clues and utilizes frequency-domain characteristics of audio signals to enhance audio-visual understanding. The QCR block guides the model to focus on semantically relevant features. Experiments on AVQA benchmarks show significant performance improvements over existing methods.

论文针对现有AVQA方法主要侧重于视觉信息而忽视了音频和文本信息的问题，提出了一种QSTar方法，该方法结合了问题引导的线索，并利用音频信号的频域特征来增强音频-视觉理解。QCR模块引导模型更精确地关注语义相关的特征。实验结果显示，在多个AVQA基准上的性能显著提升。

Predicting Startup Success Using Large Language Models: A Novel In-Context Learning Approach

Authors: Abdurahman Maarouf, Alket Bakiaj, Stefan Feuerriegel

First: 2026-01-23T09:08:52+00:00 · Latest: 2026-01-27T17:16:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Venture capital (VC) investments in early-stage startups that end up being successful can yield high returns. However, predicting early-stage startup success remains challenging due to data scarcity (e.g., many VC firms have information about only a few dozen of early-stage startups and whether they were successful). This limits the effectiveness of traditional machine learning methods that rely on large labeled datasets for model training. To address this challenge, we propose an in-context learning framework for startup success prediction using large language models (LLMs) that requires no model training and leverages only a small set of labeled startups as demonstration examples. Specifically, we propose a novel k-nearest-neighbor-based in-context learning framework, called kNN-ICL, which selects the most relevant past startups as examples based on similarity. Using real-world profiles from Crunchbase, we find that the kNN-ICL approach achieves higher prediction accuracy than supervised machine learning baselines and vanilla in-context learning. Further, we study how performance varies with the number of in-context examples and find that a high balanced accuracy can be achieved with as few as 50 examples. Together, we demonstrate that in-context learning can serve as a decision-making tool for VC firms operating in data-scarce environments.

中文标题/摘要

标题：使用大型语言模型预测创业公司成功：一种新颖的上下文学习方法

早期创业公司获得成功并获得风险资本（VC）投资可以带来高回报。然而，由于数据稀缺（例如，许多VC公司仅有关于少数几十家早期创业公司及其是否成功的信息），预测早期创业公司成功仍然具有挑战性。这限制了依赖大量标记数据集的传统机器学习方法的有效性。为了解决这一挑战，我们提出了一种使用大型语言模型（LLMs）进行创业公司成功预测的上下文学习框架，该框架无需进行模型训练，并仅利用少量标记的创业公司作为示例。具体而言，我们提出了一种基于k近邻的上下文学习框架，称为kNN-ICL，该框架根据相似性选择最相关的过去创业公司作为示例。使用Crunchbase中的真实世界资料，我们发现kNN-ICL方法的预测准确性高于监督机器学习基线和纯上下文学习。进一步地，我们研究了上下文示例数量对性能的影响，并发现即使只有50个示例，也可以实现高平衡准确率。总之，我们证明了上下文学习可以作为数据稀缺环境中VC公司决策工具。

Summary / 总结

The paper addresses the challenge of predicting early-stage startup success, which is difficult due to data scarcity. It proposes a kNN-ICL framework using large language models for in-context learning, requiring no training and using only a small set of labeled startups as demonstration examples. The study shows that kNN-ICL outperforms supervised machine learning baselines and vanilla in-context learning, achieving high balanced accuracy with as few as 50 examples.

论文针对由于数据稀缺而难以预测早期初创公司成功的问题，提出了一种基于k最近邻的上下文学习框架（kNN-ICL），使用大型语言模型（LLMs），无需训练，并使用少量标记的初创公司作为示例。研究发现，kNN-ICL在预测准确性上优于监督机器学习基线和纯上下文学习，使用50个示例即可实现较高的平衡准确率。

History

20260128_0330 20260127_0327 20260126_0321 20260125_0320 20260124_0329 20260123_0328 20260122_0333 20260121_0416 20260120_0324 20260119_0320 20260118_0318 20260117_0326 20260116_0329 20260115_0326 20260114_0325 20260113_0324 20260112_0323 20260111_0321 20260110_0324 20260109_0325 20260108_0325 20260107_0320 20260106_0327 20260105_0320 20260104_0319 20260103_0317 20260102_0329 20260101_0320 20251231_0326 20251230_0324 20251229_0320 20251228_0323 20251227_0321 20251226_0320 20251225_0320 20251224_0323 20251223_0323 20251222_0320 20251221_0320 20251220_0323 20251219_0323 20251218_0335 20251217_0324 20251216_0325 20251215_1246 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553