Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning
Authors: Haoyi Jiang, Liu Liu, Xinjie Wang, Yonghao He, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang
First: 2026-02-24T18:37:34+00:00 · Latest: 2026-02-24T18:37:34+00:00
Abstract
While Vision-Language Models (VLMs) exhibit exceptional 2D visual understanding, their ability to comprehend and reason about 3D space--a cornerstone of spatial intelligence--remains superficial. Current methodologies attempt to bridge this domain gap either by relying on explicit 3D modalities or by augmenting VLMs with partial, view-conditioned geometric priors. However, such approaches hinder scalability and ultimately burden the language model with the ill-posed task of implicitly reconstructing holistic 3D geometry from sparse cues. In this paper, we argue that spatial intelligence can emerge inherently from 2D vision alone, rather than being imposed via explicit spatial instruction tuning. To this end, we introduce Spa3R, a self-supervised framework that learns a unified, view-invariant spatial representation directly from unposed multi-view images. Spa3R is built upon the proposed Predictive Spatial Field Modeling (PSFM) paradigm, where Spa3R learns to synthesize feature fields for arbitrary unseen views conditioned on a compact latent representation, thereby internalizing a holistic and coherent understanding of the underlying 3D scene. We further integrate the pre-trained Spa3R Encoder into existing VLMs via a lightweight adapter to form Spa3-VLM, effectively grounding language reasoning in a global spatial context. Experiments on the challenging VSI-Bench demonstrate that Spa3-VLM achieves state-of-the-art accuracy of 58.6% on 3D VQA, significantly outperforming prior methods. These results highlight PSFM as a scalable path toward advancing spatial intelligence. Code is available at https://github.com/hustvl/Spa3R.
中文标题/摘要
标题:Spa3R:三维视觉推理的预测空间场建模
尽管视觉语言模型(VLMs)在二维视觉理解方面表现出色,但它们在理解和推理三维空间方面的能力仍然有限,而三维空间理解是空间智能的核心。当前的方法试图通过依赖显式的三维模态或通过部分视图条件几何先验增强VLMs来弥合这一领域差距。然而,这些方法阻碍了可扩展性,并最终使语言模型承担从稀疏线索隐式重建完整三维几何结构的不明确任务。在本文中,我们主张空间智能可以从二维视觉中自然地涌现,而不是通过显式的空间指令调优强加。为此,我们提出了Spa3R,这是一种自监督框架,可以直接从未指定的多视角图像中学习统一的、视图不变的空间表示。Spa3R基于提出的预测空间场建模(PSFM)范式,其中Spa3R学习根据紧凑的潜在表示合成任意未见视角的特征场,从而内化对底层三维场景的整体和连贯理解。我们进一步通过轻量级适配器将预训练的Spa3R编码器集成到现有的VLMs中,形成Spa3-VLM,有效地将语言推理置于全局空间上下文中。在具有挑战性的VSI-Bench实验中,Spa3-VLM在3D VQA上的准确率达到58.6%,显著优于先前的方法。这些结果突显了PSFM作为推进空间智能的可扩展途径。代码可在https://github.com/hustvl/Spa3R获取。
Summary / 总结
This paper addresses the limitation of Vision-Language Models (VLMs) in understanding 3D space and introduces Spa3R, a self-supervised framework that learns a unified spatial representation from multi-view images. Spa3R uses Predictive Spatial Field Modeling (PSFM) to synthesize feature fields for unseen views, enabling a holistic understanding of 3D scenes. When integrated into existing VLMs, Spa3-VLM achieves state-of-the-art performance on 3D VQA with 58.6% accuracy, outperforming previous methods.
论文提出了Spa3R,这是一种自监督框架,可以从多视角图像中学习统一的空间表示,无需显式的3D数据。Spa3R使用预测性空间场建模(PSFM)来合成未见过视角的特征场,从而实现对底层3D场景的整体和连贯理解。当将其集成到视觉语言模型(VLMs)中时,形成的Spa3-VLM在3D VQA上的准确率达到58.6%,显著优于先前的方法。
Seeing Through Words: Controlling Visual Retrieval Quality with Language Models
Authors: Jianglin Lu, Simon Jenni, Kushal Kafle, Jing Shi, Handong Zhao, Yun Fu
First: 2026-02-24T18:20:57+00:00 · Latest: 2026-02-24T18:20:57+00:00
Abstract
Text-to-image retrieval is a fundamental task in vision-language learning, yet in real-world scenarios it is often challenged by short and underspecified user queries. Such queries are typically only one or two words long, rendering them semantically ambiguous, prone to collisions across diverse visual interpretations, and lacking explicit control over the quality of retrieved images. To address these issues, we propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality. Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms that capture fine-grained visual attributes such as pose, scene, and aesthetics. We introduce a general framework that conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, so that query enrichment is not only semantically meaningful but also quality-aware. The resulting system provides three key advantages: 1) flexibility, it is compatible with any pretrained vision-language model (VLMs) without modification; 2) transparency, enriched queries are explicitly interpretable by users; and 3) controllability, enabling retrieval results to be steered toward user-preferred quality levels. Extensive experiments demonstrate that our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries. Our code is available at https://github.com/Jianglin954/QCQC.
中文标题/摘要
标题:透过文字看世界:利用语言模型控制视觉检索质量
文本到图像检索是视觉语言学习中的一个基本任务,但在现实场景中,由于用户查询简短且不具体,这一任务常常受到挑战。这类查询通常只有1到2个词,导致其语义模糊,容易在多种视觉解释中发生碰撞,并且缺乏对检索图像质量的明确控制。为了解决这些问题,我们提出了一种新的质量可控检索范式,该范式通过上下文细节丰富简短的查询,并结合图像质量的明确概念。我们的核心思想是利用生成语言模型作为查询扩展函数,将不明确的查询扩展为描述性形式,捕捉细微的视觉属性,如姿态、场景和美学。我们提出了一种通用框架,该框架根据相关性和美学评分模型离散的质量级别条件查询扩展,使得查询丰富不仅在语义上具有意义,而且具有质量意识。由此产生的系统提供了三个关键优势:1)灵活性,它与任何预训练的视觉语言模型(VLMs)兼容,无需修改;2)透明度,丰富后的查询可以由用户明确解释;3)可控性,使检索结果能够朝向用户偏好的质量水平进行调整。大量实验表明,我们提出的方法显著提高了检索结果,并提供了有效的质量控制,弥合了现代VLMs的表达能力和简短用户查询的不具体性之间的差距。我们的代码可在https://github.com/Jianglin954/QCQC/ 获取。
Summary / 总结
This paper addresses the challenge of text-to-image retrieval with short and ambiguous user queries by proposing a quality-controllable retrieval paradigm. It leverages a generative language model to extend underspecified queries into more descriptive forms, incorporating explicit notions of image quality. The framework conditions query completion on discretized quality levels, derived from relevance and aesthetic scoring models, ensuring that the enriched queries are both semantically meaningful and quality-aware. Experiments show that this approach significantly improves retrieval results and provides effective quality control, making it compatible with any pretrained vision-language model and enabling users to steer retrieval results toward preferred quality levels.
该论文通过提出一种质量可控的检索范式,解决了使用短且含糊的用户查询进行文本到图像检索的挑战。它利用生成语言模型将不明确的查询扩展为更具描述性的形式,同时融入图像质量的显式概念。该框架根据相关性和美学评分模型得出的离散质量级别来条件化查询完成,确保增强后的查询既具有语义意义,又具有质量意识。实验表明,这种方法显著提高了检索结果的质量,并提供了有效的质量控制,使其能够与任何预训练的视觉语言模型兼容,并使用户能够将检索结果导向所需的质量水平。
LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis
Authors: Zhifan Jiang, Dong Yang, Vishwesh Nath, Abhijeet Parida, Nishad P. Kulkarni, Ziyue Xu, Daguang Xu, Syed Muhammad Anwar, Holger R. Roth, Marius George Linguraru
Venue: ISBI
First: 2026-02-24T17:42:46+00:00 · Latest: 2026-02-24T17:42:46+00:00
Comments: Accepted to IEEE International Symposium on Biomedical Imaging (ISBI) 2026
Abstract
Large vision-language models (VLMs) have evolved from general-purpose applications to specialized use cases such as in the clinical domain, demonstrating potential for decision support in radiology. One promising application is assisting radiologists in decision-making by the analysis of radiology imaging data such as chest X-rays (CXR) via a visual and natural language question-answering (VQA) interface. When longitudinal imaging is available, radiologists analyze temporal changes, which are essential for accurate diagnosis and prognosis. The manual longitudinal analysis is a time-consuming process, motivating the development of a training framework that can provide prognostic capabilities. We introduce a novel training framework LUMEN, that is optimized for longitudinal CXR interpretation, leveraging multi-image and multi-task instruction fine-tuning to enhance prognostic and diagnostic performance. We conduct experiments on the publicly available MIMIC-CXR and its associated Medical-Diff-VQA datasets. We further formulate and construct a novel instruction-following dataset incorporating longitudinal studies, enabling the development of a prognostic VQA task. Our method demonstrates significant improvements over baseline models in diagnostic VQA tasks, and more importantly, shows promising potential for prognostic capabilities. These results underscore the value of well-designed, instruction-tuned VLMs in enabling more accurate and clinically meaningful radiological interpretation of longitudinal radiological imaging data.
中文标题/摘要
标题:LUMEN:纵向多模态放射学模型用于预后和诊断
大型视觉-语言模型(VLMs)已从通用应用发展到临床领域的专业用途,展示了在放射学领域决策支持的潜力。一种有前景的应用是通过视觉和自然语言问答(VQA)界面分析放射学影像数据(如胸部X光片),辅助放射科医生进行决策。当有纵向影像时,放射科医生会分析时间变化,这对于准确诊断和预后至关重要。手动的纵向分析是一个耗时的过程,推动了纵向影像解释训练框架的发展。我们介绍了一种新的训练框架LUMEN,该框架针对纵向胸部X光片解释进行了优化,利用多图像和多任务指令微调来增强预后和诊断性能。我们在公开的MIMIC-CXR及其相关Medical-Diff-VQA数据集上进行了实验。我们进一步制定了一个包含纵向研究的新指令遵循数据集,以促进预后VQA任务的发展。我们的方法在诊断VQA任务中显著优于基线模型,并且更重要的是,展示了预后能力的前景。这些结果强调了精心设计、指令调优的VLMs在纵向放射学影像数据放射学解释中的价值。
Summary / 总结
LUMEN is a training framework designed for longitudinal chest X-ray (CXR) interpretation, leveraging multi-image and multi-task instruction fine-tuning to improve prognostic and diagnostic performance. Experiments on MIMIC-CXR and Medical-Diff-VQA datasets show significant improvements over baseline models in diagnostic VQA tasks, and the method demonstrates promising potential for prognostic capabilities.
LUMEN 是一种用于纵向分析胸部X光片的训练框架,旨在增强放射学中的预后和诊断性能。它利用多图像和多任务指令微调,并在 MIMIC-CXR 和 Medical-Diff-VQA 数据集上进行了评估。该方法在诊断 VQA 任务中显示出显著的改进,并且在预后能力方面具有很大的潜力。
VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation
Authors: Seongheon Park, Changdae Oh, Hyeong Kyu Choi, Xuefeng Du, Sharon Li
First: 2026-02-24T16:11:14+00:00 · Latest: 2026-02-24T16:11:14+00:00
Abstract
Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model's ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for evaluating vision-conditioned predictions. We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model's output depends on visual evidence. VAUQ introduces the Image-Information Score (IS), which captures the reduction in predictive uncertainty attributable to visual input, and an unsupervised core-region masking strategy that amplifies the influence of salient regions. Combining predictive entropy with this core-masked IS yields a training-free scoring function that reliably reflects answer correctness. Comprehensive experiments show that VAUQ consistently outperforms existing self-evaluation methods across multiple datasets.
中文标题/摘要
标题:VAUQ:基于视觉的不确定性量化方法用于LVLM自我评估
大型视觉-语言模型(LVLMs)经常产生幻觉,限制了它们在实际应用中的安全部署。现有的LLM自我评估方法依赖于模型对其自身输出正确性的估计能力,这可以提高部署的可靠性;然而,它们严重依赖语言先验,因此不适合评估基于视觉的预测。我们提出VAUQ,一种基于视觉的不确定性量化框架,用于LVLM自我评估,明确衡量模型输出对视觉证据的依赖程度。VAUQ引入了图像信息得分(IS),该得分捕捉了视觉输入导致的预测不确定性减少量,并采用无监督的核心区域遮罩策略以增强显著区域的影响。结合预测熵与这种核心遮罩的IS,得到一个无需训练的评分函数,可靠地反映答案的正确性。全面的实验表明,VAUQ在多个数据集上始终优于现有的自我评估方法。
Summary / 总结
The research aims to address the issue of hallucination in Large Vision-Language Models (LVLMs) by developing a method for self-evaluation that is more reliable for vision-conditioned predictions. VAUQ, a vision-aware uncertainty quantification framework, introduces the Image-Information Score (IS) to measure the model's dependence on visual evidence and an unsupervised core-region masking strategy to enhance the influence of salient regions. Experimental results demonstrate that VAUQ outperforms existing self-evaluation methods across various datasets in terms of reliably reflecting answer correctness.
研究旨在通过提出VAUQ,一种视觉感知不确定性量化框架,解决大型视觉语言模型(LVLM)的幻觉问题。VAUQ通过图像信息得分(IS)和无监督的核心区域遮罩策略,衡量模型输出对视觉证据的依赖程度。该方法结合预测熵与核心遮罩的IS,生成一个无需训练即可反映答案正确性的评分函数。实验表明,VAUQ在多个数据集上优于现有自我评估方法。
OCR-Agent: Agentic OCR with Capability and Memory Reflection
Authors: Shimin Wen, Zeyu Zhang, Xingdou Bian, Hongjie Zhu, Lulu He, Layi Shama, Daji Ergu, Ying Cai
First: 2026-02-24T16:10:27+00:00 · Latest: 2026-02-24T16:10:27+00:00
Abstract
Large Vision-Language Models (VLMs) have demonstrated significant potential on complex visual understanding tasks through iterative optimization methods.However, these models generally lack effective self-correction mechanisms, making it difficult for them to independently rectify cognitive biases. Consequently, during multi-turn revisions, they often fall into repetitive and ineffective attempts, failing to achieve stable improvements in answer quality.To address this issue, we propose a novel iterative self-correction framework that endows models with two key capabilities: Capability Reflection and Memory Reflection. This framework guides the model to first diagnose errors and generate a correction plan via Capability Reflection, then leverage Memory Reflection to review past attempts to avoid repetition and explore new solutions, and finally, optimize the answer through rigorous re-reasoning. Experiments on the challenging OCRBench v2 benchmark show that OCR-Agent outperforms the current open-source SOTA model InternVL3-8B by +2.0 on English and +1.2 on Chinese subsets, while achieving state-of-the-art results in Visual Understanding (79.9) and Reasoning (66.5) - surpassing even larger fine-tuned models. Our method demonstrates that structured, self-aware reflection can significantly enhance VLMs' reasoning robustness without additional training. Code: https://github.com/AIGeeksGroup/OCR-Agent.
中文标题/摘要
标题:OCR-Agent: 具有能力和记忆反思的代理OCR
大型视觉-语言模型(VLMs)通过迭代优化方法在复杂的视觉理解任务中展现了显著的潜力。然而,这些模型通常缺乏有效的自我纠正机制,使得它们难以独立纠正认知偏差。因此,在多轮修订过程中,它们往往陷入重复且无效的尝试,无法稳定提高答案质量。为解决这一问题,我们提出了一种新的迭代自我纠正框架,赋予模型两种关键能力:能力反思和记忆反思。该框架引导模型首先通过能力反思诊断错误并生成纠正计划,然后通过记忆反思回顾过去尝试以避免重复并探索新解决方案,最后通过严格的重新推理优化答案。在具有挑战性的OCRBench v2基准测试上,OCR-Agent在英文子集上比当前开源SOTA模型InternVL3-8B高出+2.0,在中文子集上高出+1.2,同时在视觉理解(79.9)和推理(66.5)方面达到最先进的结果,甚至超越了更大规模的微调模型。我们的方法表明,结构化的自我意识反思可以显著增强VLMs的推理稳健性,而无需额外的训练。代码:https://github.com/AIGeeksGroup/OCR-Agent.
Summary / 总结
The research aims to improve the self-correction capabilities of large Vision-Language Models (VLMs) by introducing a novel iterative self-correction framework called OCR-Agent. This framework includes Capability Reflection and Memory Reflection to diagnose errors, avoid repetition, and explore new solutions. Experiments on OCRBench v2 show that OCR-Agent outperforms the current SOTA model InternVL3-8B by 2.0 points on English and 1.2 points on Chinese subsets, and achieves state-of-the-art results in Visual Understanding and Reasoning, surpassing larger fine-tuned models.
论文提出了一种名为OCR-Agent的新迭代自我纠正框架,该框架包括能力反思和记忆反思。该框架帮助模型诊断错误、避免重复尝试并探索新解决方案,从而提高答案质量。实验结果显示,OCR-Agent在OCRBench v2上的表现优于当前的SOTA模型InternVL3-8B,分别在英文和中文子集上提高了2.0%和1.2%,并在视觉理解和推理方面达到了最先进的结果,超越了更大的微调模型。
Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning
Authors: Junhao Xiao, Zhiyu Wu, Hao Lin, Yi Chen, Yahui Liu, Xiaoran Zhao, Zixu Wang, Zejiang He
First: 2026-02-24T15:55:39+00:00 · Latest: 2026-02-24T15:55:39+00:00
Abstract
Vision-Language Models (VLMs) like CLIP struggle to understand negation, often embedding affirmatives and negatives similarly (e.g., matching "no dog" with dog images). Existing methods refine negation understanding via fine-tuning CLIP's text encoder, risking overfitting. In this work, we propose CLIPGlasses, a plug-and-play framework that enhances CLIP's ability to comprehend negated visual descriptions. CLIPGlasses adopts a dual-stage design: a Lens module disentangles negated semantics from text embeddings, and a Frame module predicts context-aware repulsion strength, which is integrated into a modified similarity computation to penalize alignment with negated semantics, thereby reducing false positive matches. Experiments show that CLIP equipped with CLIPGlasses achieves competitive in-domain performance and outperforms state-of-the-art methods in cross-domain generalization. Its superiority is especially evident under low-resource conditions, indicating stronger robustness across domains.
中文标题/摘要
标题:不仅仅是所见:使CLIP理解否定视觉描述而不进行微调
视觉-语言模型(VLMs)如CLIP在理解否定方面存在困难,通常将肯定和否定的嵌入表示得相似(例如,“没有狗”与狗的图像匹配)。现有方法通过微调CLIP的文本编码器来改进否定理解,这存在过拟合的风险。在本文中,我们提出了一种即插即用框架CLIPGlasses,该框架增强了CLIP理解否定视觉描述的能力。CLIPGlasses采用两阶段设计:一个镜片模块将否定语义从文本嵌入中分离出来,一个框架模块预测上下文相关的排斥强度,将其整合到修改后的相似性计算中,以惩罚与否定语义的对齐,从而减少错误匹配。实验表明,配备CLIPGlasses的CLIP在领域内性能具有竞争力,并在跨领域泛化方面优于最先进的方法。特别是在资源有限的情况下,其优越性尤为明显,表明其在不同领域中具有更强的鲁棒性。
Summary / 总结
This work addresses the challenge of CLIP's difficulty in understanding negation in visual descriptions. It introduces CLIPGlasses, a dual-stage framework that includes a Lens module to disentangle negated semantics and a Frame module to predict context-aware repulsion strength, which is integrated into similarity computation to penalize false positive matches. Experimental results demonstrate that CLIPGlasses enhances CLIP's performance in both in-domain and cross-domain tasks, particularly under low-resource conditions, showing improved robustness across domains.
本文解决了CLIP在理解视觉描述中的否定表达方面存在的困难。提出了一种名为CLIPGlasses的双阶段框架,无需微调即可增强CLIP对否定描述的理解能力。Lens模块分离出否定语义,而Frame模块预测上下文相关的排斥强度,将其整合到相似性计算中以减少误匹配。实验结果表明,CLIPGlasses在领域内性能上表现出色,并在跨领域泛化方面优于现有方法,特别是在低资源条件下表现出更强的鲁棒性。
From Perception to Action: An Interactive Benchmark for Vision Reasoning
Authors: Yuhao Wu, Maojia Song, Yihuai Lan, Lei Wang, Zhiqiang Hu, Yao Xiao, Heng Zhou, Weihua Zheng, Dylan Raharja, Soujanya Poria, Roy Ka-Wei Lee
First: 2026-02-24T15:33:02+00:00 · Latest: 2026-02-24T15:33:02+00:00
Comments: Work in processing. Website: https://social-ai-studio.github.io/CHAIN/
Abstract
Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.
中文标题/摘要
标题:从感知到行动:一个用于视觉推理的交互式基准
理解物理结构对于实际应用如具身智能体、交互设计和长时操作至关重要。然而,现有的视觉-语言模型(VLM)评估仍然集中在结构无关、单轮次的设置(如VQA)上,无法评估智能体如何推理几何、接触和支撑关系如何共同限制动态环境中的可能行动。为解决这一问题,我们引入了因果行动和交互层次(CHAIN)基准,这是一个交互式的3D、物理驱动的测试平台,旨在评估模型是否能够理解、规划和执行基于物理约束的结构化行动序列。CHAIN将评估从被动感知转向主动问题解决,涵盖如机械嵌合谜题和3D堆叠与打包等任务。我们在统一的交互式设置下对最先进的VLM和基于扩散的模型进行了全面研究。结果显示,表现最好的模型仍然难以内化物理结构和因果约束,经常无法生成可靠的长期计划,也无法稳健地将感知到的结构转化为有效的行动。项目详情请参见https://social-ai-studio.github.io/CHAIN/。
Summary / 总结
The research addresses the gap in evaluating vision-language models' ability to reason about physical constraints in dynamic environments. It introduces the CHAIN benchmark, which evaluates models' understanding, planning, and execution of structured action sequences through an interactive 3D physics-driven testbed. The study finds that top-performing models struggle with internalizing physical structure and causal constraints, often failing to produce reliable long-horizon plans and robustly translating perceived structure into effective actions.
研究旨在评估模型在现实世界应用中对物理约束的推理能力。CHAIN基准引入了一个互动的3D物理驱动环境,以测试模型对结构化动作序列的理解、规划和执行能力。关键发现表明,当前表现最佳的模型在内化物理结构和因果约束方面存在困难,往往无法生成可靠的长期计划,并且难以将感知到的结构有效转化为实际动作。
VII: Visual Instruction Injection for Jailbreaking Image-to-Video Generation Models
Authors: Bowen Zheng, Yongli Xiang, Ziming Hong, Zerong Lin, Chaojian Yu, Tongliang Liu, Xinge You
First: 2026-02-24T15:20:01+00:00 · Latest: 2026-02-24T15:20:01+00:00
Comments: Project page: https://Zbwwwwwwww.github.io/VII
Abstract
Image-to-Video (I2V) generation models, which condition video generation on reference images, have shown emerging visual instruction-following capability, allowing certain visual cues in reference images to act as implicit control signals for video generation. However, this capability also introduces a previously overlooked risk: adversaries may exploit visual instructions to inject malicious intent through the image modality. In this work, we uncover this risk by proposing Visual Instruction Injection (VII), a training-free and transferable jailbreaking framework that intentionally disguises the malicious intent of unsafe text prompts as benign visual instructions in the safe reference image. Specifically, VII coordinates a Malicious Intent Reprogramming module to distill malicious intent from unsafe text prompts while minimizing their static harmfulness, and a Visual Instruction Grounding module to ground the distilled intent onto a safe input image by rendering visual instructions that preserve semantic consistency with the original unsafe text prompt, thereby inducing harmful content during I2V generation. Empirically, our extensive experiments on four state-of-the-art commercial I2V models (Kling-v2.5-turbo, Gemini Veo-3.1, Seedance-1.5-pro, and PixVerse-V5) demonstrate that VII achieves Attack Success Rates of up to 83.5% while reducing Refusal Rates to near zero, significantly outperforming existing baselines.
中文标题/摘要
标题:VII:视觉指令注入以破解图像到视频生成模型
图像到视频(I2V)生成模型能够根据参考图像生成视频,显示出新兴的视觉指令跟随能力,允许参考图像中的某些视觉线索作为视频生成的隐式控制信号。然而,这种能力也引入了一种之前未被注意到的风险:攻击者可能利用视觉指令通过图像模态注入恶意意图。在本研究中,我们通过提出一种无需训练且可转移的破解框架——视觉指令注入(VII),揭示了这一风险。VII故意将不安全文本提示中的恶意意图伪装成在安全参考图像中的良性视觉指令。具体而言,VII协调了一个恶意意图重编程模块,从不安全文本提示中提取恶意意图并最小化其静态危害性,以及一个视觉指令定位模块,将提取的意图定位到安全输入图像上,通过渲染与原始不安全文本提示保持语义一致性的视觉指令,从而在I2V生成过程中诱导有害内容。实证上,我们在四个最先进的商用I2V模型(Kling-v2.5-turbo、Gemini Veo-3.1、Seedance-1.5-pro和PixVerse-V5)上进行的广泛实验表明,VII在攻击成功率高达83.5%的同时,将拒绝率降低到接近零,显著优于现有基线。
Summary / 总结
This work addresses the risk of adversaries exploiting visual instructions in Image-to-Video (I2V) models to inject malicious intent. It introduces Visual Instruction Injection (VII), a training-free framework that disguises unsafe text prompts as benign visual instructions in safe reference images. Experiments on four commercial I2V models show that VII can achieve an Attack Success Rate of up to 83.5% while nearly eliminating Refusal Rates, outperforming existing methods.
该研究关注对手利用参考图像中的视觉指令注入恶意意图的风险。提出了Visual Instruction Injection (VII)框架,将不安全的文本提示伪装成 benign 的视觉指令。实验表明,VII 在四个商用 I2V 模型上可以实现高达 83.5% 的攻击成功率,同时几乎消除了拒绝率,优于现有方法。
Training-Free Intelligibility-Guided Observation Addition for Noisy ASR
Authors: Haoyang Li, Changsong Liu, Wei Rao, Hao Shi, Sakriani Sakti, Eng Siong Chng
First: 2026-02-24T14:46:54+00:00 · Latest: 2026-02-24T14:46:54+00:00
Abstract
Automatic speech recognition (ASR) degrades severely in noisy environments. Although speech enhancement (SE) front-ends effectively suppress background noise, they often introduce artifacts that harm recognition. Observation addition (OA) addressed this issue by fusing noisy and SE enhanced speech, improving recognition without modifying the parameters of the SE or ASR models. This paper proposes an intelligibility-guided OA method, where fusion weights are derived from intelligibility estimates obtained directly from the backend ASR. Unlike prior OA methods based on trained neural predictors, the proposed method is training-free, reducing complexity and enhances generalization. Extensive experiments across diverse SE-ASR combinations and datasets demonstrate strong robustness and improvements over existing OA baselines. Additional analyses of intelligibility-guided switching-based alternatives and frame versus utterance-level OA further validate the proposed design.
中文标题/摘要
标题:无训练智能可懂度引导观察添加以应对噪声ASR
自动语音识别(ASR)在噪声环境中严重退化。尽管语音增强(SE)前端有效抑制了背景噪声,但它们通常会引入损害识别的伪影。观察添加(OA)通过融合噪声和SE增强的语音解决了这一问题,提高了识别效果而不修改SE或ASR模型的参数。本文提出了一种智能可懂度引导的OA方法,其中融合权重直接从后端ASR获得的可懂度估计中得出。与基于训练神经预测器的先前OA方法不同,所提出的方法无需训练,降低了复杂性并增强了泛化能力。跨多种SE-ASR组合和数据集的广泛实验表明,该方法具有很强的鲁棒性和优于现有OA基线的改进。对基于可懂度引导切换的替代方案以及帧级与句级OA的额外分析进一步验证了所提出的设计。
Summary / 总结
This paper addresses the degradation of automatic speech recognition (ASR) in noisy environments by proposing an intelligibility-guided observation addition (OA) method. Unlike previous OA methods that rely on trained neural predictors, this training-free approach derives fusion weights from ASR intelligibility estimates, enhancing recognition without modifying SE or ASR models. Experiments show strong robustness and improvements over existing OA baselines across various SE-ASR combinations and datasets.
本文提出了一种基于可懂度指导的观察添加(OA)方法,以解决噪声环境下自动语音识别(ASR)性能下降的问题。与依赖训练神经预测器的先前OA方法不同,所提出的方法直接从ASR后端获取可懂度估计值来确定融合权重,从而使其无需训练并减少复杂性。实验结果显示,该方法在各种SE-ASR组合和数据集上具有较强的鲁棒性和优于现有OA基线的方法性能。
SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models
Authors: Yuechen Xie, Xiaoyan Zhang, Yicheng Shan, Hao Zhu, Rui Tang, Rong Wei, Mingli Song, Yuanyu Wan, Jie Song
Venue: CVPR 2026
First: 2026-02-24T13:38:37+00:00 · Latest: 2026-02-24T13:38:37+00:00
Comments: Accepted by CVPR 2026
Abstract
Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects in complex scenes, but also the logical dependencies between steps in multi-step tasks. To bridge this gap, we introduce Spatial Logical Question Answering (SpatiaLQA), a benchmark designed to evaluate the spatial logical reasoning capabilities of VLMs. SpatiaLQA consists of 9,605 question answer pairs derived from 241 real-world indoor scenes. We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning. To address this issue, we propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods. Code and dataset are available at https://github.com/xieyc99/SpatiaLQA.
中文标题/摘要
标题:SpatiaLQA:评估视觉语言模型空间逻辑推理能力的标准
视觉语言模型(VLMs)由于其出色的理解和推理能力,在现实场景中的应用越来越广泛。尽管VLMs已经在常见的视觉问答和逻辑推理任务中展示了令人印象深刻的性能,但在复杂现实环境中的合理决策能力仍然不足。我们定义这种能力为空间逻辑推理,它不仅需要理解复杂场景中物体之间的空间关系,还需要理解多步任务中步骤之间的逻辑依赖关系。为了弥合这一差距,我们引入了空间逻辑问答(SpatiaLQA),这是一个旨在评估VLMs空间逻辑推理能力的标准。SpatiaLQA 包含来自241个真实室内场景的9,605个问答对。我们在41种主流VLMs上进行了广泛的实验,结果表明,即使是最先进的模型在空间逻辑推理方面仍然存在困难。为了解决这一问题,我们提出了一种名为递归场景图辅助推理的方法,该方法利用视觉基础模型逐步分解复杂场景为与任务相关的场景图,从而增强VLMs的空间逻辑推理能力,优于所有先前的方法。代码和数据集可在https://github.com/xieyc99/SpatiaLQA/ 获取。
Summary / 总结
The research aims to evaluate the spatial logical reasoning capabilities of Vision-Language Models (VLMs) by introducing SpatiaLQA, a benchmark with 9,605 question-answer pairs from 241 real-world indoor scenes. Extensive experiments on 41 VLMs show that these models struggle with spatial logical reasoning. The study proposes a method called recursive scene graph assisted reasoning, which enhances VLMs' spatial logical reasoning ability and outperforms previous methods.
SpatiaLQA 是一个基准,用于评估视觉语言模型(VLMs)的空间逻辑推理能力。它包含来自 241 个真实室内场景的 9,605 个问答对。对 41 种主流 VLMs 的实验表明,即使是最先进的模型在空间逻辑推理方面也存在问题。作者提出了一种递归场景图辅助推理方法,以增强 VLMs 的空间逻辑推理能力,超过了之前的所有方法。
When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
Authors: Yongli Xiang, Ziming Hong, Zhaoqing Wang, Xiangyu Zhao, Bo Han, Tongliang Liu
Venue: CVPR 2026
First: 2026-02-24T13:20:31+00:00 · Latest: 2026-02-24T13:20:31+00:00
Comments: CVPR 2026; Code is released at https://github.com/tmllab/2026_CVPR_CASG
Abstract
Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are averaged across multiple harmful categories based on predefined keywords. However, these approaches fail to capture the complex interplay among different harm categories, leading to "harmful conflicts" where mitigating one type of harm may inadvertently amplify another, thus increasing overall harmful rate. To address this issue, we propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation. CASG is composed of two components: (i) Conflict-aware Category Identification (CaCI), which identifies the harmful category most aligned with the model's evolving generative state, and (ii) Conflict-resolving Guidance Application (CrGA), which applies safety steering solely along the identified category to avoid multi-category interference. CASG can be applied to both latent-space and text-space safeguards. Experiments on T2I safety benchmarks demonstrate CASG's state-of-the-art performance, reducing the harmful rate by up to 15.4% compared to existing methods.
中文标题/摘要
标题:当安全相撞:通过自适应安全指导解决文本到图像扩散中的多类别有害冲突
文本到图像(T2I)扩散模型在生成高质量图像方面取得了显著进展,但同时也引发了关于有害内容生成的安全问题。基于安全指导的方法已被提出,通过引导生成远离预定义关键词定义的有害区域来减轻有害输出。然而,这些方法未能捕捉不同有害类别之间的复杂相互作用,导致“有害冲突”,即减轻一种有害类型可能无意中放大另一种,从而增加整体有害率。为解决这一问题,我们提出了冲突感知自适应安全指导(CASG),这是一种无需训练的框架,能够在生成过程中动态识别并应用与模型生成状态最一致的类别导向。CASG 包含两个组件:(i)冲突感知类别识别(CaCI),识别与模型生成状态最一致的有害类别,以及(ii)冲突解决指导应用(CrGA),仅沿识别的类别应用安全引导,以避免多类别干扰。CASG 可应用于潜在空间和文本空间保护。在 T2I 安全基准上的实验表明,CASG 达到了最先进的性能,与现有方法相比,有害率最多降低了 15.4%。
Summary / 总结
This paper addresses the challenge of harmful content generation in text-to-image diffusion models by proposing Conflict-aware Adaptive Safety Guidance (CASG). CASG dynamically identifies and applies category-aligned safety guidance to avoid multi-category interference, thereby reducing harmful content. Experiments show that CASG outperforms existing methods, decreasing the harmful rate by up to 15.4%.
论文提出了一种名为冲突感知自适应安全引导(CASG)的方法,该方法在生成过程中动态识别并应用与当前生成状态最匹配的有害类别,以减少有害内容的生成。该方法通过减少高达15.4%的有害率,解决了单一类型危害缓解可能加剧其他类型危害的问题。该框架包括冲突感知类别识别(CaCI)和冲突解决引导应用(CrGA)两个组件,并可应用于潜空间和文本空间的安全保护措施。
Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs
Authors: Dhita Putri Pratama, Soyeon Caren Han, Yihao Ding
First: 2026-02-24T13:20:07+00:00 · Latest: 2026-02-24T13:20:07+00:00
Abstract
Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from misidentifying causally relevant information. We introduce Vision-Language Causal Graphs (VLCGs), a structured, query-conditioned representation that explicitly encodes causally relevant objects, attributes, relations, and scene-grounded assumptions. Building on this representation, we present ViLCaR, a diagnostic benchmark comprising tasks for Causal Attribution, Causal Inference, and Question Answering, along with graph-aligned evaluation metrics that assess relevance identification beyond final answer accuracy. Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning. These findings suggest that current limitations in LVLM causal reasoning stem primarily from insufficient structural guidance rather than a lack of reasoning capacity.
中文标题/摘要
标题:通过结构相关图诊断视觉语言模型的因果推理
大型视觉语言模型(LVLMs)在视觉问答基准测试中表现出色,但往往依赖于虚假的相关性而非真正的因果推理。现有评估主要评估答案的正确性,这使得不清楚失败是由于推理能力有限还是由于错误地识别了因果相关的信息。我们引入了视觉语言因果图(VLCGs),这是一种结构化、查询条件化的表示,明确编码了因果相关对象、属性、关系和场景相关的假设。基于这种表示,我们提出了ViLCaR,这是一个诊断基准,包括因果归因、因果推理和问答任务,以及图对齐的评估指标,这些指标评估了相关性识别,而不仅仅是最终答案的准确性。在最先进的LVLMs中的实验表明,注入结构相关信息与零样本和标准上下文学习相比,显著提高了归因和推理一致性。这些发现表明,当前LVLM因果推理的局限性主要来自于结构指导不足,而不是推理能力的缺乏。
Summary / 总结
The research aims to diagnose the causal reasoning capabilities of large Vision-Language Models (LVLMs) by introducing Vision-Language Causal Graphs (VLCGs) that explicitly encode causally relevant information. ViLCaR, a diagnostic benchmark, evaluates LVLMs on tasks of causal attribution, causal inference, and question answering using graph-aligned metrics. Experiments show that incorporating structured relevance information enhances attribution and inference consistency in state-of-the-art LVLMs, indicating that current limitations in causal reasoning are mainly due to insufficient structural guidance rather than a lack of reasoning capacity.
研究旨在通过引入Vision-Language Causal Graphs (VLCGs),明确编码因果相关信息,诊断大型Vision-Language模型(LVLM)的因果推理能力。ViLCaR是一个诊断基准,使用图对齐的评估指标评估LVLM在因果归因、因果推理和问答任务上的表现。实验表明,整合结构化的相关性信息可以显著提高属性识别和推理一致性,这表明当前LVLM因果推理的局限性主要是由于缺乏足够的结构指导,而不是推理能力不足。
Flow-Based Conformal Predictive Distributions
Authors: Trevor Harris
First: 2026-02-07T17:26:50+00:00 · Latest: 2026-02-24T13:19:00+00:00
Comments: 9 pages, 7 figures, 10 appendix pages
Abstract
Conformal prediction provides a distribution-free framework for uncertainty quantification via prediction sets with exact finite-sample coverage. In low dimensions these sets are easy to interpret, but in high-dimensional or structured output spaces they are difficult to represent and use, which can limit their ability to integrate with downstream tasks such as sampling and probabilistic forecasting. We show that any differentiable nonconformity score induces a deterministic flow on the output space whose trajectories converge to the boundary of the corresponding conformal prediction set. This leads to a computationally efficient, training-free method for sampling conformal boundaries in arbitrary dimensions. Boundary samples can be reconformalized to form pointwise prediction sets with controlled risk and, optionally, repulsed along the boundary to improve geometric coverage. Mixing across confidence levels yields conformal predictive distributions whose quantile regions coincide exactly with conformal prediction sets. We evaluate the approach on PDE inverse problems, precipitation downscaling, climate model debiasing, and hurricane trajectory forecasting.
中文标题/摘要
标题:基于流的容错预测分布
容错预测提供了一种通过预测集进行不确定性量化的方法,这些预测集具有确切的有限样本覆盖率。在低维空间中,这些集合并易于解释,但在高维或结构化输出空间中,它们难以表示和使用,这可能限制了它们与下游任务(如采样和概率预测)的集成能力。我们展示了任何可微的非一致性分数会诱导输出空间上的确定性流,其轨迹收敛到相应的容错预测集的边界。这导致了一种在任意维度中高效且无需训练的方法,用于采样容错边界。边界样本可以重新容错以形成具有可控风险的点预测集,并可选地沿边界推开以改善几何覆盖。在不同置信水平之间混合会产生与容错预测集完全一致的容错预测分布的分位区域。我们在偏微分方程反问题、降水缩放、气候模型校正和飓风轨迹预测方面评估了该方法。
MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification
Authors: Jiahao Xu, Sheng Huang, Xin Zhang, Zhixiong Nan, Jiajun Dong, Nankun Mu
Venue: CVPR 2026
First: 2026-02-24T13:17:35+00:00 · Latest: 2026-02-24T13:17:35+00:00
Comments: Accepted by CVPR 2026
Abstract
In computational pathology, few-shot whole slide image classification is primarily driven by the extreme scarcity of expert-labeled slides. Recent vision-language methods incorporate textual semantics generated by large language models, but treat these descriptions as static class-level priors that are shared across all samples and lack sample-wise refinement. This limits both the diversity and precision of visual-semantic alignment, hindering generalization under limited supervision. To overcome this, we propose the stochastic MUlti-view Semantic Enhancement (MUSE), a framework that first refines semantic precision via sample-wise adaptation and then enhances semantic richness through retrieval-augmented multi-view generation. Specifically, MUSE introduces Sample-wise Fine-grained Semantic Enhancement (SFSE), which yields a fine-grained semantic prior for each sample through MoE-based adaptive visual-semantic interaction. Guided by this prior, Stochastic Multi-view Model Optimization (SMMO) constructs an LLM-generated knowledge base of diverse pathological descriptions per class, then retrieves and stochastically integrates multiple matched textual views during training. These dynamically selected texts serve as enriched semantic supervisions to stochastically optimize the vision-language model, promoting robustness and mitigating overfitting. Experiments on three benchmark WSI datasets show that MUSE consistently outperforms existing vision-language baselines in few-shot settings, demonstrating that effective few-shot pathology learning requires not only richer semantic sources but also their active and sample-aware semantic optimization. Our code is available at: https://github.com/JiahaoXu-god/CVPR2026_MUSE.
中文标题/摘要
标题:MUSE:利用精确多样的语义进行少量样本全切片图像分类
在计算病理学中,少量样本全切片图像分类主要受到专家标注切片极度稀缺的驱动。最近的视觉-语言方法结合了大型语言模型生成的文本语义,但将这些描述视为静态的类先验,这些先验在所有样本中共享且缺乏样本级别的细化。这限制了视觉-语义对齐的多样性和精确性,阻碍了在有限监督下的泛化。为克服这一问题,我们提出了随机多视图语义增强(MUSE)框架,该框架首先通过样本级别的适应细化语义精度,然后通过检索增强的多视图生成增强语义丰富性。具体而言,MUSE 引入了样本级细粒度语义增强(SFSE),通过基于 MoE 的自适应视觉-语义交互为每个样本生成细粒度的语义先验。在该先验的引导下,随机多视图模型优化(SMMO)构建了每个类的多样化病理描述的 LLM 生成知识库,然后在训练期间检索并随机整合多个匹配的文本视图。这些动态选择的文本作为丰富的语义监督,随机优化视觉-语言模型,促进鲁棒性并减轻过拟合。在三个基准 WSI 数据集上的实验表明,MUSE 在少量样本设置中始终优于现有的视觉-语言基线,表明有效的少量样本病理学习不仅需要更丰富的语义来源,还需要它们的主动和样本感知的语义优化。我们的代码可在:https://github.com/JiahaoXu-god/CVPR2026_MUSE 获取。
Summary / 总结
MUSE addresses the limitations of static class-level textual semantics in few-shot whole slide image classification by proposing a framework that refines semantic precision and enhances semantic richness. It introduces Sample-wise Fine-grained Semantic Enhancement (SFSE) for adaptive visual-semantic interaction and Stochastic Multi-view Model Optimization (SMMO) for retrieving and integrating diverse textual views. Experiments on three benchmark WSI datasets show that MUSE outperforms existing vision-language baselines, highlighting the importance of active and sample-aware semantic optimization for robust few-shot pathology learning.
MUSE通过增强语义的精确性和丰富性来克服现有少量样本全视野图像分类方法的局限性。它引入了样本级细粒度语义增强(SFSE)进行自适应视觉-语义交互,并通过随机多视图模型优化(SMMO)动态整合多种文本视图。实验表明,MUSE在三个基准WSI数据集上的表现优于现有视觉-语言基线,强调了样本感知的语义优化对于少量样本病理学习的鲁棒性的重要性。
On the Explainability of Vision-Language Models in Art History
Authors: Stefanie Schneider
First: 2026-02-24T12:53:28+00:00 · Latest: 2026-02-24T12:53:28+00:00
Abstract
Vision-Language Models (VLMs) transfer visual and textual data into a shared embedding space. In so doing, they enable a wide range of multimodal tasks, while also raising critical questions about the nature of machine 'understanding.' In this paper, we examine how Explainable Artificial Intelligence (XAI) methods can render the visual reasoning of a VLM - namely, CLIP - legible in art-historical contexts. To this end, we evaluate seven methods, combining zero-shot localization experiments with human interpretability studies. Our results indicate that, while these methods capture some aspects of human interpretation, their effectiveness hinges on the conceptual stability and representational availability of the examined categories.
中文标题/摘要
标题:视觉语言模型在艺术史中的可解释性研究
视觉语言模型(VLMs)将视觉和文本数据转换到共享嵌入空间。通过这种方式,它们能够执行一系列跨模态任务,同时也引发了关于机器‘理解’本质的关键问题。在本文中,我们探讨了可解释人工智能(XAI)方法如何使视觉语言模型(如CLIP)在艺术史背景下的视觉推理变得可读。为此,我们评估了七种方法,结合零样本定位实验与人类可解释性研究。我们的结果表明,虽然这些方法捕捉到了一些人类解释的方面,但它们的有效性取决于所研究类别概念稳定性和表征可用性的程度。
Training-Free Multi-Concept Image Editing
Authors: Niki Foteinopoulou, Ignas Budvytis, Stephan Liwicki
First: 2026-02-24T12:27:51+00:00 · Latest: 2026-02-24T12:27:51+00:00
Comments: 17 pages, 13 figures
Abstract
Editing images with diffusion models without training remains challenging. While recent optimisation-based methods achieve strong zero-shot edits from text, they struggle to preserve identity or capture details that language alone cannot express. Many visual concepts such as facial structure, material texture, or object geometry are impossible to express purely through text prompts alone. To address this gap, we introduce a training-free framework for concept-based image editing, which unifies Optimised DDS with LoRA-driven concept composition, where the training data of the LoRA represent the concept. Our approach enables combining and controlling multiple visual concepts directly within the diffusion process, integrating semantic guidance from text with low-level cues from pretrained concept adapters. We further refine DDS for stability and controllability through ordered timesteps, regularisation, and negative-prompt guidance. Quantitative and qualitative results demonstrate consistent improvements over existing training-free diffusion editing methods on InstructPix2Pix and ComposLoRA benchmarks. Code will be made publicly available.
中文标题/摘要
标题:无需训练的多概念图像编辑
使用扩散模型进行图像编辑而无需训练仍然具有挑战性。尽管最近的基于优化的方法能够实现从文本进行强大的零样本编辑,但它们在保留身份或捕捉语言无法表达的细节方面存在困难。许多视觉概念,如面部结构、材料纹理或对象几何形状,仅通过文本提示无法完全表达。为了解决这一差距,我们提出了一种无需训练的概念驱动图像编辑框架,该框架将优化DDS与LoRA驱动的概念组合统一起来,其中LoRA的训练数据代表概念。我们的方法能够在扩散过程中直接结合和控制多个视觉概念,将文本的语义指导与预训练概念适配器提供的低级线索结合起来。我们进一步通过有序的时间步、正则化和负提示指导来细化DDS,以提高稳定性和可控性。定量和定性结果表明,与现有的无需训练的扩散编辑方法相比,在InstructPix2Pix和ComposLoRA基准上具有一致的改进。代码将公开发布。
Summary / 总结
The paper addresses the challenge of training-free multi-concept image editing using diffusion models. It introduces a framework that combines Optimised DDS with LoRA-driven concept composition, allowing for the integration of multiple visual concepts directly within the diffusion process. The approach uses pretrained concept adapters to provide semantic guidance and low-level cues, and it further refines DDS for better stability and controllability. Experimental results show consistent improvements over existing methods on InstructPix2Pix and ComposLoRA benchmarks.
研究解决了在不进行训练的情况下使用扩散模型进行图像编辑的难题,尤其是这些方法在保持身份和捕捉超出文本描述的细节方面表现不佳。该研究提出了一种无需训练的框架,结合了优化的DDS与基于LoRA的概念组成,能够在扩散过程中直接整合多个视觉概念。该方法通过有序的时间步、正则化和负向提示指导来增强稳定性和可控性,展示了在InstructPix2Pix和ComposLoRA基准测试中的持续改进。
GatedCLIP: Gated Multimodal Fusion for Hateful Memes Detection
Authors: Yingying Guo, Ke Zhang, Zirong Zeng
First: 2026-02-24T11:54:54+00:00 · Latest: 2026-02-24T11:54:54+00:00
Comments: Preprint
Abstract
Detecting hateful content in multimodal memes presents unique challenges, as harmful messages often emerge from the complex interplay between benign images and text. We propose GatedCLIP, a Vision-Language model that enhances CLIP's multimodal capabilities with specialized architectural improvements for hateful memes detection. Our approach introduces learned projection heads that map CLIP embeddings to a task-optimized semantic space, a dynamic gated fusion mechanism that adaptively weights visual and textual features, and a contrastive learning objective that maintains cross-modal semantic alignment. Experiments on the Hateful Memes dataset demonstrate that GatedCLIP achieves an AUROC of 0.66, substantially outperforming the CLIP baseline (AUROC 0.49) while maintaining computational efficiency with only 350K trainable parameters.
中文标题/摘要
标题:GatedCLIP:针对有害表情包检测的门控多模态融合
在多模态表情包中检测有害内容面临着独特的挑战,因为有害信息往往源自无害图像和文本之间的复杂交互。我们提出了GatedCLIP,这是一种视觉-语言模型,通过专门的架构改进增强了CLIP的多模态能力,以用于有害表情包检测。我们的方法引入了学习投影头,将CLIP嵌入映射到任务优化的语义空间,动态门控融合机制以自适应加权视觉和文本特征,以及对比学习目标以保持跨模态语义对齐。在Hateful Memes数据集上的实验表明,GatedCLIP的AUROC为0.66,显著优于CLIP基线(AUROC 0.49),同时仅使用350K可训练参数保持了计算效率。
Summary / 总结
The research aims to address the challenge of detecting hateful content in multimodal memes, where harmful messages can arise from the interaction between images and text. GatedCLIP, a Vision-Language model, is proposed to enhance CLIP's multimodal fusion with specialized architectural improvements. Key features include learned projection heads, a dynamic gated fusion mechanism, and a contrastive learning objective. Experiments show that GatedCLIP achieves an AUROC of 0.66, significantly outperforming the CLIP baseline (AUROC 0.49) with only 350K trainable parameters.
GatedCLIP 通过增强 CLIP 的多模态能力来检测多模态 meme 中的仇恨内容,包括学习投影头、动态门控融合机制和对比学习目标。实验结果显示,GatedCLIP 的 AUROC 达到 0.66,显著优于 CLIP 基线模型,且仅有 35 万个可训练参数。
VGGDrive: Empowering Vision-Language Models with Cross-View Geometric Grounding for Autonomous Driving
Authors: Jie Wang, Guang Li, Zhijian Huang, Chenxu Dang, Hangjun Ye, Yahong Han, Long Chen
Venue: CVPR 2026
First: 2026-02-24T11:33:44+00:00 · Latest: 2026-02-24T11:33:44+00:00
Comments: CVPR 2026
Abstract
The significance of cross-view 3D geometric modeling capabilities for autonomous driving is self-evident, yet existing Vision-Language Models (VLMs) inherently lack this capability, resulting in their mediocre performance. While some promising approaches attempt to mitigate this by constructing Q&A data for auxiliary training, they still fail to fundamentally equip VLMs with the ability to comprehensively handle diverse evaluation protocols. We thus chart a new course, advocating for the infusion of VLMs with the cross-view geometric grounding of mature 3D foundation models, closing this critical capability gap in autonomous driving. In this spirit, we propose a novel architecture, VGGDrive, which empowers Vision-language models with cross-view Geometric Grounding for autonomous Driving. Concretely, to bridge the cross-view 3D geometric features from the frozen visual 3D model with the VLM's 2D visual features, we introduce a plug-and-play Cross-View 3D Geometric Enabler (CVGE). The CVGE decouples the base VLM architecture and effectively empowers the VLM with 3D features through a hierarchical adaptive injection mechanism. Extensive experiments show that VGGDrive enhances base VLM performance across five autonomous driving benchmarks, including tasks like cross-view risk perception, motion prediction, and trajectory planning. It's our belief that mature 3D foundation models can empower autonomous driving tasks through effective integration, and we hope our initial exploration demonstrates the potential of this paradigm to the autonomous driving community.
中文标题/摘要
标题:VGGDrive:通过跨视图几何定位增强视觉语言模型以赋能自动驾驶
跨视图3D几何建模能力对于自动驾驶的重要性不言而喻,但现有的视觉语言模型(VLMs)本身缺乏这种能力,导致其表现平平。尽管一些有前景的方法试图通过构建问答数据进行辅助训练来缓解这一问题,但它们仍然无法从根本上使VLMs具备全面处理各种评估协议的能力。因此,我们提出了一个新的方向,主张将成熟的3D基础模型的跨视图几何定位能力注入到VLMs中,以填补自动驾驶中的这一关键能力缺口。在此精神下,我们提出了一种新的架构VGGDrive,该架构赋予视觉语言模型跨视图几何定位能力以赋能自动驾驶。具体而言,为了将冻结的视觉3D模型中的跨视图3D几何特征与VLM的2D视觉特征连接起来,我们引入了一种即插即用的跨视图3D几何增强器(CVGE)。CVGE解耦了基础VLM架构,并通过分层自适应注入机制有效地赋予VLM 3D特征。广泛的实验表明,VGGDrive在包括交叉视图风险感知、运动预测和轨迹规划在内的五个自动驾驶基准测试中提升了基础VLM的表现。我们相信,成熟的3D基础模型可以通过有效的集成赋能自动驾驶任务,我们希望我们的初步探索能够向自动驾驶社区展示这一范式的潜力。
Summary / 总结
VGGDrive is proposed to enhance Vision-Language Models (VLMs) with cross-view geometric grounding for autonomous driving. It introduces a Cross-View 3D Geometric Enabler (CVGE) to bridge 3D geometric features from a frozen 3D model with the 2D visual features of VLMs. Experiments show that VGGDrive improves base VLM performance across five autonomous driving benchmarks, including cross-view risk perception, motion prediction, and trajectory planning.
论文针对现有视觉-语言模型(VLM)在处理自主驾驶中的跨视图3D几何建模能力不足的问题,提出了一种新的架构VGGDrive,通过引入跨视图3D几何增强器(CVGE)将冻结的3D模型的3D几何特征与VLM的2D视觉特征进行整合。实验结果显示,VGGDrive在包括交叉视图风险感知和运动预测在内的五个自主驾驶基准任务中提高了基模型的表现。
Keep it SymPL: Symbolic Projective Layout for Allocentric Spatial Reasoning in Vision-Language Models
Authors: Jaeyun Jang, Seunghui Shin, Taeho Park, Hyoseok Hwang
First: 2026-02-22T10:18:54+00:00 · Latest: 2026-02-24T10:19:29+00:00
Abstract
Perspective-aware spatial reasoning involves understanding spatial relationships from specific viewpoints-either egocentric (observer-centered) or allocentric (object-centered). While vision-language models (VLMs) perform well in egocentric settings, their performance deteriorates when reasoning from allocentric viewpoints, where spatial relations must be inferred from the perspective of objects within the scene. In this study, we address this underexplored challenge by introducing Symbolic Projective Layout (SymPL), a framework that reformulates allocentric reasoning into symbolic-layout forms that VLMs inherently handle well. By leveraging four key factors-projection, abstraction, bipartition, and localization-SymPL converts allocentric questions into structured symbolic-layout representations. Extensive experiments demonstrate that this reformulation substantially improves performance in both allocentric and egocentric tasks, enhances robustness under visual illusions and multi-view scenarios, and that each component contributes critically to these gains. These results show that SymPL provides an effective and principled approach for addressing complex perspective-aware spatial reasoning.
中文标题/摘要
标题:保持简洁:符号投影布局在视觉语言模型中处理以物为中心的空间推理
视角感知的空间推理涉及从特定视角理解空间关系,无论是以我为中心(观察者为中心)还是以物为中心。虽然视觉语言模型(VLMs)在以我为中心的环境中表现良好,但在从以物为中心的视角进行推理时,其性能会下降,因为在这些视角中,空间关系必须从场景中物体的视角进行推断。在本研究中,我们通过引入符号投影布局(SymPL)框架来解决这一未充分探索的挑战,该框架将以物为中心的推理重新表述为视觉语言模型能够很好地处理的符号布局形式。通过利用四个关键因素——投影、抽象、二分和定位,SymPL将以物为中心的问题转换为结构化的符号布局表示。广泛的实验表明,这种重新表述在以物为中心和以我为中心的任务中显著提高了性能,增强了在视觉幻觉和多视图场景下的鲁棒性,并且每个组件都对这些改进做出了关键贡献。这些结果表明,SymPL提供了一种有效且原理性的方法来解决复杂的视角感知空间推理问题。
Summary / 总结
This study addresses the challenge of allocentric spatial reasoning in vision-language models by introducing SymPL, a framework that reformulates allocentric reasoning into symbolic-layout forms. By leveraging projection, abstraction, bipartition, and localization, SymPL converts allocentric questions into structured representations that VLMs can handle effectively. Experiments show that this approach improves performance in both allocentric and egocentric tasks, enhances robustness under visual illusions and multi-view scenarios, and that each component of SymPL is critical to these gains.
该研究通过引入SymPL框架来解决视觉语言模型在处理从物体中心视角的空间推理方面的挑战,该框架将从物体中心视角的空间推理重新表述为符号布局形式。通过利用投影、抽象、二分和定位四个关键因素,SymPL将从物体中心视角的问题转换为结构化的表示形式,这些表示形式是视觉语言模型能够有效处理的。实验表明,这种方法在从物体中心视角和从观察者中心视角的任务中都提高了性能,增强了在视觉错觉和多视角场景下的鲁棒性,并且每个组件对这些改进都是至关重要的。
MIRROR: Multimodal Iterative Reasoning via Reflection on Visual Regions
Authors: Haoyu Zhang, Yuwei Wu, Pengxiang Li, Xintong Zhang, Zhi Gao, Rui Gao, Mingyang Gao, Che Sun, Yunde Jia
First: 2026-02-21T07:56:59+00:00 · Latest: 2026-02-24T09:35:41+00:00
Abstract
In the era of Vision-Language Models (VLMs), enhancing multimodal reasoning capabilities remains a critical challenge, particularly in handling ambiguous or complex visual inputs, where initial inferences often lead to hallucinations or logic errors. Existing VLMs often produce plausible yet ungrounded answers, and even when prompted to "reflect", their corrections may remain detached from the image evidence. To address this, we propose the MIRROR framework for Multimodal Iterative Reasoning via Reflection On visual Regions. By embedding visual reflection as a core mechanism, MIRROR is formulated as a closed-loop process comprising draft, critique, region-based verification, and revision, which are repeated until the output is visually grounded. To facilitate training of this model, we construct **ReflectV**, a visual reflective dataset for multi-turn supervision that explicitly contains reflection triggers, region-based verification actions, and answer revision grounded in visual evidence. Experiments on both general vision-language benchmarks and representative vision-language reasoning benchmarks show that MIRROR improves correctness and reduces visual hallucinations, demonstrating the value of training reflection as an evidence-seeking, region-aware verification process rather than a purely textual revision step.
中文标题/摘要
标题:MIRROR:通过视觉区域反思进行多模态迭代推理
在视觉语言模型(VLMs)的时代,增强多模态推理能力仍然是一个关键挑战,特别是在处理含糊或复杂的视觉输入时,初始推断往往会导致幻觉或逻辑错误。现有的VLMs经常产生看似合理但实际上缺乏依据的答案,即使被提示“反思”,它们的修正也可能与图像证据脱节。为了解决这个问题,我们提出了MIRROR框架,用于通过视觉区域反思进行多模态迭代推理。通过将视觉反思作为核心机制,MIRROR被构造成一个闭环过程,包括草稿、批评、基于区域的验证和修订,这些步骤会重复进行,直到输出具有视觉依据。为了促进该模型的训练,我们构建了**ReflectV**视觉反思数据集,该数据集明确包含反思触发、基于区域的验证动作以及视觉证据支持的答案修订。在通用视觉语言基准测试和代表性视觉语言推理基准测试中的实验表明,MIRROR提高了正确性并减少了视觉幻觉,证明了将反思训练为一种寻求证据、区域意识的验证过程而非纯粹的文本修订步骤的价值。
Summary / 总结
The research aims to enhance multimodal reasoning capabilities in Vision-Language Models (VLMs) by addressing the issue of hallucinations and logic errors in ambiguous or complex visual inputs. The MIRROR framework proposes a closed-loop process of draft, critique, region-based verification, and revision, which is repeated until the output is visually grounded. To train this model, a new dataset called ReflectV is constructed, which includes reflection triggers, region-based verification actions, and answer revisions grounded in visual evidence. Experiments show that MIRROR improves correctness and reduces visual hallucinations compared to existing VLMs.
研究旨在通过解决视觉输入中的模糊性问题,增强视觉语言模型(VLMs)的多模态推理能力,解决初始推理导致的幻觉和逻辑错误。MIRROR框架提出了一种闭环过程,包括草稿、批评、基于区域的验证和修订,以确保模型的输出与视觉证据一致。实验表明,MIRROR在正确性和减少视觉幻觉方面优于现有VLMs,强调了训练模型以寻求和验证视觉证据的重要性。
NGL-Prompter: Training-Free Sewing Pattern Estimation from a Single Image
Authors: Anna Badalyan, Pratheba Selvaraju, Giorgio Becherini, Omid Taheri, Victoria Fernandez Abrevaya, Michael Black
First: 2026-02-24T09:01:11+00:00 · Latest: 2026-02-24T09:01:11+00:00
Comments: 10 pages, 7 figures
Abstract
Estimating sewing patterns from images is a practical approach for creating high-quality 3D garments. Due to the lack of real-world pattern-image paired data, prior approaches fine-tune large vision language models (VLMs) on synthetic garment datasets generated by randomly sampling from a parametric garment model GarmentCode. However, these methods often struggle to generalize to in-the-wild images, fail to capture real-world correlations between garment parts, and are typically restricted to single-layer outfits. In contrast, we observe that VLMs are effective at describing garments in natural language, yet perform poorly when asked to directly regress GarmentCode parameters from images. To bridge this gap, we propose NGL (Natural Garment Language), a novel intermediate language that restructures GarmentCode into a representation more understandable to language models. Leveraging this language, we introduce NGL-Prompter, a training-free pipeline that queries large VLMs to extract structured garment parameters, which are then deterministically mapped to valid GarmentCode. We evaluate our method on the Dress4D, CloSe and a newly collected dataset of approximately 5,000 in-the-wild fashion images. Our approach achieves state-of-the-art performance on standard geometry metrics and is strongly preferred in both human and GPT-based perceptual evaluations compared to existing baselines. Furthermore, NGL-prompter can recover multi-layer outfits whereas competing methods focus mostly on single-layer garments, highlighting its strong generalization to real-world images even with occluded parts. These results demonstrate that accurate sewing pattern reconstruction is possible without costly model training. Our code and data will be released for research use.
中文标题/摘要
标题:NGL-Prompter:无需训练的单张图像缝制图案估计
从图像中估计缝制图案是创建高质量3D服装的一种实用方法。由于缺乏现实世界的图案-图像配对数据,先前的方法在合成服装数据集上微调大型视觉语言模型(VLMs),这些数据集是通过从参数化服装模型GarmentCode中随机采样生成的。然而,这些方法往往难以泛化到野外图像,无法捕捉服装部件之间的实际关联,并且通常仅限于单层服装。相比之下,我们观察到VLMs在自然语言中描述服装方面非常有效,但在直接从图像回归GarmentCode参数时表现不佳。为了弥合这一差距,我们提出了NGL(自然服装语言),这是一种新型的中间语言,将GarmentCode重新结构化为更易于语言模型理解的表示。利用这种语言,我们引入了NGL-Prompter,这是一种无需训练的流水线,可以查询大型VLMs以提取结构化的服装参数,然后将这些参数确定性地映射到有效的GarmentCode。我们在Dress4D、CloSe和一个新收集的约5,000张野外时尚图像的数据集上评估了我们的方法。我们的方法在标准几何度量上达到了最先进的性能,并且在人类和GPT基的感知评估中都比现有基线方法更受欢迎。此外,NGL-Prompter可以恢复多层服装,而竞争方法主要关注单层服装,这突显了其在即使有遮挡部分的情况下也能很好地泛化到现实世界的图像。这些结果表明,无需昂贵的模型训练即可实现准确的缝制图案重建。我们的代码和数据将用于研究。
Summary / 总结
The paper addresses the challenge of estimating sewing patterns from single images without paired data. It proposes NGL-Prompter, a training-free method that uses a novel intermediate language NGL to restructure GarmentCode parameters into a more understandable format for large vision language models. The method queries these models to extract structured garment parameters, which are then mapped to valid GarmentCode. Experiments on the Dress4D, CloSe, and a new dataset of 5,000 in-the-wild fashion images show that NGL-Prompter outperforms existing methods on geometry metrics and is preferred in perceptual evaluations, especially for multi-layer outfits with occluded parts.
研究旨在无需配对数据的情况下,从单张图片中估计缝制图案,解决以往依赖合成数据的方法的局限性。提出的NGL-Prompter使用一种新型中间语言NGL,将GarmentCode参数重新结构化为大型视觉语言模型更容易理解的格式。该方法在几何度量上达到最先进的性能,并在人类和GPT基的感知评估中优于现有方法,特别是在处理多层服装和具有遮挡部分的真实世界图像方面表现出色。
PromptCD: Test-Time Behavior Enhancement via Polarity-Prompt Contrastive Decoding
Authors: Baolong Bi, Yuyao Ge, Shenghua Liu, Yuchen He, Siqian Tong, Lizhe Chen, Lingrui Mei, Zehao Li, Yiwei Wang, Yujun Cai, Ming-Hsuan Yang, Xueqi Cheng
First: 2026-02-24T08:56:52+00:00 · Latest: 2026-02-24T08:56:52+00:00
Abstract
Reliable AI systems require large language models (LLMs) to exhibit behaviors aligned with human preferences and values. However, most existing alignment approaches operate at training time and rely on additional high-quality data, incurring significant computational and annotation costs. While recent work has shown that contrastive decoding can leverage a model's internal distributions to improve specific capabilities, its applicability remains limited to narrow behavioral scopes and scenarios. In this work, we introduce Polarity-Prompt Contrastive Decoding (PromptCD), a test-time behavior control method that generalizes contrastive decoding to broader enhancement settings. PromptCD constructs paired positive and negative guiding prompts for a target behavior and contrasts model responses-specifically token-level probability distributions in LLMs and visual attention patterns in VLMs-to reinforce desirable outcomes. This formulation extends contrastive decoding to a wide range of enhancement objectives and is applicable to both LLMs and Vision-Language Models (VLMs) without additional training. For LLMs, experiments on the "3H" alignment objectives (helpfulness, honesty, and harmlessness) demonstrate consistent and substantial improvements, indicating that post-trained models can achieve meaningful self-enhancement purely at test time. For VLMs, we further analyze contrastive effects on visual attention, showing that PromptCD significantly improves VQA performance by reinforcing behavior-consistent visual grounding. Collectively, these results highlight PromptCD as a simple, general, and cost-efficient strategy for reliable behavior control across modalities.
中文标题/摘要
标题:PromptCD:通过极性提示对比解码提高测试时行为
可靠的AI系统需要大型语言模型(LLMs)的行为与人类偏好和价值观保持一致。然而,大多数现有的对齐方法在训练时运行,并依赖于额外的高质量数据,这会带来显著的计算和标注成本。虽然最近的工作表明,对比解码可以通过利用模型内部的分布来提高特定能力,但其适用范围仍然局限于狭窄的行为范围和场景。在本研究中,我们提出了极性提示对比解码(PromptCD),这是一种测试时行为控制方法,将对比解码推广到更广泛的增强设置。PromptCD 构建了针对目标行为的正向和负向引导提示,并对比模型响应,特别是LLMs的特定标记概率分布和VLMs的视觉注意力模式,以强化期望的结果。这种形式将对比解码扩展到广泛的增强目标,并且无需额外训练即可适用于LLMs和视觉语言模型(VLMs)。对于LLMs,针对“3H”对齐目标(有用性、诚实性和无害性)的实验表明,可以实现一致且显著的改进,表明后训练模型可以在测试时实现有意义的自我增强。对于VLMs,我们进一步分析了对比效果对视觉注意力的影响,表明PromptCD 显著提高了VQA性能,通过强化行为一致的视觉定位。这些结果共同突显了PromptCD 作为一种简单、通用且成本效益高的策略,可以在不同模态中实现可靠的行为空控。
Summary / 总结
PromptCD is a test-time behavior control method that enhances the alignment of large language models (LLMs) and vision-language models (VLMs) with human preferences. It uses paired positive and negative guiding prompts to contrast model responses, reinforcing desirable outcomes. Experiments show consistent improvements in LLMs for the '3H' alignment objectives (helpfulness, honesty, and harmlessness) and significant enhancements in VQA performance for VLMs by improving visual grounding consistency. This method is simple, general, and cost-efficient for behavior control across different modalities.
PromptCD 是一种测试时行为控制方法,用于增强大型语言模型(LLMs)和视觉语言模型(VLMs)与人类偏好的一致性。它通过正负向提示对比模型响应,改善了 token 级概率分布和视觉注意力模式。实验结果显示,LLMs 在 '3H' 对齐目标上表现出一致的改进,而 VLMs 的视觉问答(VQA)性能也得到了显著提升。
How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective
Authors: Bo Peng, Pi Bu, Keyu Pan, Xinrun Xu, Yinxiu Zhao, Miao Chen, Yang Du, Lin Li, Jun Song, Tong Xu
First: 2026-02-24T08:42:41+00:00 · Latest: 2026-02-24T08:42:41+00:00
Abstract
Recent advances in vision-language models (VLMs) have shown promise for human-level embodied intelligence. However, existing benchmarks for VLM-driven embodied agents often rely on high-level commands or discretized action spaces, which are non-native settings that differ markedly from real-world control. In addition, current benchmarks focus primarily on high-level tasks and lack joint evaluation and analysis at both low and high levels. To address these limitations, we present NativeEmbodied, a challenging benchmark for VLM-driven embodied agents that uses a unified, native low-level action space. Built on diverse simulated scenes, NativeEmbodied includes three representative high-level tasks in complex scenarios to evaluate overall performance. For more detailed analysis, we further decouple the skills required by complex tasks and construct four types of low-level tasks, each targeting a fundamental embodied skill. This joint evaluation across task and skill granularities enables fine-grained assessment of embodied agents. Experiments with state-of-the-art VLMs reveal clear deficiencies in several fundamental embodied skills, and further analysis shows that these bottlenecks significantly limit performance on high-level tasks. NativeEmbodied highlights key challenges for current VLM-driven embodied agents and provides insights to guide future research.
中文标题/摘要
标题:基础技能如何影响基于VLM的具身代理:本土视角
近期视觉语言模型(VLMs)的发展为实现人类水平的具身智能带来了希望。然而,现有的VLM驱动的具身代理基准通常依赖于高级命令或离散的动作空间,这些设置是非本土的,与实际世界的控制有显著差异。此外,当前的基准主要集中在高级任务上,缺乏对低级和高级任务的联合评估和分析。为了解决这些局限性,我们提出了NativeEmbodied,这是一个针对VLM驱动的具身代理的具有挑战性的基准,使用统一的本土低级动作空间构建。基于多样化的模拟场景,NativeEmbodied 包含三个代表性的复杂场景中的高级任务,以评估整体性能。为了进行更详细的分析,我们进一步将复杂任务所需的能力分解,并构建了四种类型的低级任务,每种任务针对一种基本的具身能力。这种跨任务和能力粒度的联合评估使具身代理的细粒度评估成为可能。使用最先进的VLM进行的实验揭示了几个基本的具身能力中的明显缺陷,进一步分析表明,这些瓶颈显著限制了高级任务的性能。NativeEmbodied突出了当前VLM驱动的具身代理的关键挑战,并提供了指导未来研究的见解。
Improving Motion in Image-to-Video Models via Adaptive Low-Pass Guidance
Authors: June Suk Choi, Kyungmin Lee, Sihyun Yu, Yisol Choi, Jinwoo Shin, Kimin Lee
First: 2025-06-10T05:23:46+00:00 · Latest: 2026-02-24T08:19:40+00:00
Comments: Project page: http://choi403.github.io/ALG
Abstract
Recent text-to-video (T2V) models have demonstrated strong capabilities in producing high-quality, dynamic videos. To improve the visual controllability, recent works have considered fine-tuning pre-trained T2V models to support image-to-video (I2V) generation. However, such adaptation frequently suppresses motion dynamics of generated outputs, resulting in more static videos compared to their T2V counterparts. In this work, we analyze this phenomenon and identify that it stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image. To address this, we propose adaptive low-pass guidance (ALG), a simple training-free fix to the I2V model sampling procedure to generate more dynamic videos without compromising per-frame image quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying a low-pass filter at the early stage of denoising. Extensive experiments show ALG significantly improves the temporal dynamics of generated videos, while preserving or even improving image fidelity and text alignment. For instance, on the VBench test suite, ALG achieves a 33% average improvement across models in dynamic degree while maintaining the original video quality. For additional visualizations and source code, see the project page.
中文标题/摘要
标题:通过自适应低通引导提高图像到视频模型中的运动效果
近期的文本到视频(T2V)模型展示了生成高质量动态视频的强大能力。为了提高视觉可控性,近期的研究考虑了对预训练的T2V模型进行微调以支持图像到视频(I2V)生成。然而,这种适应经常抑制生成输出的运动动态,导致与T2V模型相比,生成的视频更加静态。在本文中,我们分析了这一现象,并发现其根源在于输入图像过早暴露于高频细节,这使采样过程偏向于一条捷径轨迹,过度拟合参考图像的静态外观。为了解决这一问题,我们提出了自适应低通引导(ALG),这是一种简单的训练外修正方法,用于I2V模型采样过程,以生成更具动态性的视频,同时不牺牲每帧图像质量。具体而言,ALG在去噪的早期阶段通过应用低通滤波器来适应性地调节条件图像的频率内容。广泛的实验表明,ALG显著提高了生成视频的时间动态性,同时保持或甚至提高了图像保真度和文本对齐。例如,在VBench测试套件上,ALG在动态程度上实现了33%的平均改进,同时保持了原始视频质量。有关额外的可视化和源代码,请参见项目页面。
Summary / 总结
This work addresses the issue of reduced motion dynamics in image-to-video (I2V) models compared to text-to-video (T2V) models by proposing adaptive low-pass guidance (ALG). ALG mitigates the suppression of motion dynamics by modulating the frequency content of the conditioning image with a low-pass filter during the early stages of denoising. Experiments show that ALG enhances temporal dynamics while maintaining or improving image fidelity and text alignment, achieving a 33% average improvement in dynamic degree on the VBench test suite.
本文解决了图像到视频(I2V)模型中运动动态减少的问题,这是文本到视频(T2V)模型中的常见问题。作者提出了一种无需训练的方法——自适应低通引导(ALG),通过调节输入图像的频率内容来生成更具动态性的视频。实验表明,ALG 在 VBench 测试套件上将动态程度提高了 33%,同时保持或提升了图像保真度和文本对齐。
Recursive Belief Vision Language Model
Authors: Vaidehi Bagaria, Bijo Sebastian, Nirav Patel
First: 2026-02-24T08:02:16+00:00 · Latest: 2026-02-24T08:02:16+00:00
Abstract
Current vision-language-action (VLA) models struggle with long-horizon manipulation under partial observability. Most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. Semantic reasoning alone is not the primary bottleneck in long-horizon manipulation. Instead, VLAs lack persistent, action-conditioned state representations and exhibit limited temporal and physical reasoning, making them ill-suited for multi-stage control. This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. Queried once for high-level intent, the VLM provides task specification, while the belief tracks task progress and enables phase-aware, causally grounded control under partial observability without storing raw observations or scaling memory with time. The belief and intent jointly condition a diffusion policy for robust closed-loop execution. RB-VLA outperforms prior VLAs on long-horizon benchmarks, achieving 52.5% and 37.5% higher success on multi-stage pick-and-place and stacking tasks, respectively, compared to π0. It also reduces inference latency by up to 5x relative to baselines and eliminates memory growth across timesteps observed in existing VLAs. Ablations show that the belief module is the primary driver of performance, increasing success rates from 32.5% to 77.5%. These results demonstrate the effectiveness of belief-based state representations for long-horizon VLA policies.
Summary / 总结
The paper addresses the limitations of current vision-language-action (VLA) models in handling long-horizon manipulation tasks under partial observability. It introduces RB-VLA, a belief-centric architecture that maintains a compact latent state to track task progress and enable phase-aware control. RB-VLA outperforms previous models on multi-stage pick-and-place and stacking tasks, with success rates 52.5% and 37.5% higher, respectively, and reduces inference latency by up to 5x compared to baselines. Ablation studies indicate that the belief module significantly improves performance, increasing success rates from 32.5% to 77.5%.
本文针对当前视觉-语言-动作模型在处理部分可观测条件下的长期操作任务时的局限性,提出了一种以信念为中心的架构RB-VLA,该架构维持一个紧凑的潜状态来跟踪任务进度并实现阶段感知控制。RB-VLA在多阶段拾放和堆叠任务上的表现优于先前的模型,成功率分别提高了52.5%和37.5%。与基线相比,它还将推理延迟降低了最多5倍。消融实验表明,信念模块显著提高了性能,将成功率从32.5%提高到77.5%。这项工作突显了信念驱动的状态表示在长期视觉-语言-动作策略中的有效性。
Vision-Language Models for Ergonomic Assessment of Manual Lifting Tasks: Estimating Horizontal and Vertical Hand Distances from RGB Video
Authors: Mohammad Sadra Rajabi, Aanuoluwapo Ojelade, Sunwook Kim, Maury A. Nussbaum
First: 2026-02-24T08:01:49+00:00 · Latest: 2026-02-24T08:01:49+00:00
Abstract
Manual lifting tasks are a major contributor to work-related musculoskeletal disorders, and effective ergonomic risk assessment is essential for quantifying physical exposure and informing ergonomic interventions. The Revised NIOSH Lifting Equation (RNLE) is a widely used ergonomic risk assessment tool for lifting tasks that relies on six task variables, including horizontal (H) and vertical (V) hand distances; such distances are typically obtained through manual measurement or specialized sensing systems and are difficult to use in real-world environments. We evaluated the feasibility of using innovative vision-language models (VLMs) to non-invasively estimate H and V from RGB video streams. Two multi-stage VLM-based pipelines were developed: a text-guided detection-only pipeline and a detection-plus-segmentation pipeline. Both pipelines used text-guided localization of task-relevant regions of interest, visual feature extraction from those regions, and transformer-based temporal regression to estimate H and V at the start and end of a lift. For a range of lifting tasks, estimation performance was evaluated using leave-one-subject-out validation across the two pipelines and seven camera view conditions. Results varied significantly across pipelines and camera view conditions, with the segmentation-based, multi-view pipeline consistently yielding the smallest errors, achieving mean absolute errors of approximately 6-8 cm when estimating H and 5-8 cm when estimating V. Across pipelines and camera view configurations, pixel-level segmentation reduced estimation error by approximately 20-30% for H and 35-40% for V relative to the detection-only pipeline. These findings support the feasibility of VLM-based pipelines for video-based estimation of RNLE distance parameters.
中文标题/摘要
标题:用于评估手动举重任务的人机工程学评估的视觉-语言模型:从RGB视频估计水平和垂直手距
手动举重任务是与工作相关的肌肉骨骼疾病的主要原因,有效的工程风险评估对于量化物理暴露并指导工程干预措施至关重要。修订后的NIOSH举重方程(RNLE)是一种广泛使用的举重任务工程风险评估工具,依赖于包括水平(H)和垂直(V)手距在内的六个任务变量;这些距离通常通过手动测量或专用传感系统获得,在实际环境中难以使用。我们评估了使用创新的视觉-语言模型(VLMs)从RGB视频流非侵入性估计H和V的可行性。开发了两种多阶段VLM基线管道:一个文本引导的仅检测管道和一个检测加分割管道。两种管道均使用文本引导的任务相关感兴趣区域的定位、从这些区域提取视觉特征,并使用基于变换器的时间回归来估计举重开始和结束时的H和V。通过在两种管道和七种摄像机视图条件下进行留一被试验证,评估了多种举重任务的估计性能。结果在不同管道和摄像机视图条件下差异显著,基于分割的多视图管道始终产生最小的误差,当估计H时平均绝对误差约为6-8厘米,当估计V时约为5-8厘米。在不同管道和摄像机视图配置下,像素级分割相对于仅检测管道将H的估计误差减少了约20-30%,V的估计误差减少了约35-40%。这些发现支持基于VLM的管道在视频中估计RNLE距离参数的可行性。
Summary / 总结
The study aims to assess the feasibility of using vision-language models (VLMs) to estimate horizontal and vertical hand distances during manual lifting tasks from RGB video streams, which are crucial for ergonomic risk assessment. Two pipelines were developed: a text-guided detection-only pipeline and a detection-plus-segmentation pipeline. The segmentation-based, multi-view pipeline showed the best performance, achieving mean absolute errors of approximately 6-8 cm for horizontal distances and 5-8 cm for vertical distances. Pixel-level segmentation reduced estimation errors by about 20-30% for horizontal distances and 35-40% for vertical distances compared to the detection-only pipeline.
研究旨在评估使用视觉语言模型(VLMs)从RGB视频流中估计手动搬运任务中手的水平(H)和垂直(V)距离的可行性。开发了两种管道:文本引导的检测-only管道和检测+分割管道。基于分割的多视图管道表现最佳,H的平均绝对误差约为6-8厘米,V的平均绝对误差约为5-8厘米。与检测-only管道相比,像素级分割将H的估计误差降低了约20-30%,V降低了约35-40%。
HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning
Authors: Chuhao Zhou, Jianfei Yang
Venue: NeurIPS 2025
First: 2025-05-23T09:06:09+00:00 · Latest: 2026-02-24T07:02:26+00:00
Comments: Camera-ready version. Accepted at NeurIPS 2025
Abstract
Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language. While Vision-Language Models (VLMs) have enabled impressive language-grounded perception, their reliance on visual data limits robustness in real-world scenarios with occlusions, poor lighting, or privacy constraints. In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning across heterogeneous environments. We address two key challenges: (1) the scarcity of aligned modality-text data for rare sensors, and (2) the heterogeneity of their physical signal representations. To overcome these, we design a Universal Modality-Injection Projector (UMIP) that enhances pre-aligned modality embeddings with fine-grained, text-aligned features from tailored encoders via coarse-to-fine cross-attention without introducing significant alignment overhead. We further introduce a human-VLM collaborative data curation pipeline to generate paired textual annotations for sensing datasets. Extensive experiments on two newly constructed benchmarks show that HoloLLM significantly outperforms existing MLLMs, improving language-grounded human sensing accuracy by up to 30%. This work establishes a new foundation for real-world, language-informed multisensory embodied intelligence.
中文标题/摘要
标题:HoloLLM:语言导向的人体感知与推理的多感官基础模型
在智能家居中运作的具身智能体必须通过多种感官输入理解人类行为,并通过自然语言进行交流。尽管视觉语言模型(VLMs)已经实现了令人印象深刻的语言导向感知,但它们对视觉数据的依赖限制了在具有遮挡、照明不良或隐私限制的真实世界场景中的鲁棒性。在本文中,我们介绍了HoloLLM,这是一种多模态大型语言模型(MLLM),它整合了诸如激光雷达、红外线、毫米波雷达和WiFi等不常见但强大的传感模态,以在异构环境中实现无缝的人体感知和推理。我们解决了两个关键挑战:(1)稀有传感器对齐模态-文本数据的稀缺性,(2)它们物理信号表示的异质性。为了解决这些问题,我们设计了一个通用模态注入投影器(UMIP),它通过粗到细的交叉注意力增强预对齐的模态嵌入,引入了细粒度、文本对齐的特征,来自定制编码器,而不会引入显著的对齐开销。我们还引入了一种人类-VLM协作数据整理流水线,以生成传感数据集的配对文本注释。在两个新构建的基准上的广泛实验表明,HoloLLM 显著优于现有 MLLMs,将语言导向的人体感知准确性提高了高达 30%。这项工作为现实世界的、语言驱动的多感官具身智能奠定了新的基础。
Summary / 总结
HoloLLM is a Multimodal Large Language Model that integrates various sensing modalities like LiDAR, infrared, mmWave radar, and WiFi to enable robust human perception and reasoning in smart homes. It addresses the challenges of scarce aligned data and heterogeneous signal representations by using a Universal Modality-Injection Projector and a human-VLM collaborative data curation pipeline. Experiments show that HoloLLM improves language-grounded human sensing accuracy by up to 30% compared to existing models.
HoloLLM 是一个结合了 LiDAR、红外、毫米波雷达和 WiFi 等多种传感模态的多模态大型语言模型,以增强基于语言的人类感知和推理能力。通过通用模态注入投影器和人类-VLM 合作数据整理流水线来解决稀缺对齐数据和异构信号表示的挑战。实验表明,HoloLLM 相较于现有模型在语言指导的人类感知准确性上提高了最多 30%。
CRAFT-LoRA: Content-Style Personalization via Rank-Constrained Adaptation and Training-Free Fusion
Authors: Yu Li, Yujun Cai, Chi Zhang
First: 2026-02-21T19:05:11+00:00 · Latest: 2026-02-24T06:30:28+00:00
Abstract
Personalized image generation requires effectively balancing content fidelity with stylistic consistency when synthesizing images based on text and reference examples. Low-Rank Adaptation (LoRA) offers an efficient personalization approach, with potential for precise control through combining LoRA weights on different concepts. However, existing combination techniques face persistent challenges: entanglement between content and style representations, insufficient guidance for controlling elements' influence, and unstable weight fusion that often require additional training. We address these limitations through CRAFT-LoRA, with complementary components: (1) rank-constrained backbone fine-tuning that injects low-rank projection residuals to encourage learning decoupled content and style subspaces; (2) a prompt-guided approach featuring an expert encoder with specialized branches that enables semantic extension and precise control through selective adapter aggregation; and (3) a training-free, timestep-dependent classifier-free guidance scheme that enhances generation stability by strategically adjusting noise predictions across diffusion steps. Our method significantly improves content-style disentanglement, enables flexible semantic control over LoRA module combinations, and achieves high-fidelity generation without additional retraining overhead.
中文标题/摘要
标题:CRAFT-LoRA:基于秩约束适配和无训练融合的内容-风格个性化
基于文本和参考示例生成个性化图像时,需要有效平衡内容保真度与风格一致性。低秩适配(LoRA)提供了一种高效的个性化方法,通过结合不同概念的LoRA权重,可以实现精确控制。然而,现有的组合技术面临持续的挑战:内容和风格表示的纠缠、对元素影响的控制不足以及不稳定权重融合,通常需要额外的训练。我们通过CRAFT-LoRA解决了这些限制,包括互补组件:(1) 基于秩约束的主干微调,通过注入低秩投影残差来促进学习解耦的内容和风格子空间;(2) 一种基于提示的方法,包含一个专家编码器和专门分支,通过选择性适配聚合实现语义扩展和精确控制;(3) 一种无训练、时间步长依赖的无条件引导方案,通过在扩散步骤中战略性调整噪声预测来增强生成稳定性。我们的方法显著提高了内容-风格的解耦,实现了对LoRA模块组合的灵活语义控制,并在无需额外重新训练开销的情况下实现了高保真生成。
Summary / 总结
CRAFT-LoRA addresses the challenges in personalized image generation by combining rank-constrained adaptation and training-free fusion. It introduces a rank-constrained backbone fine-tuning method to promote the separation of content and style, a prompt-guided expert encoder for precise control, and a classifier-free guidance scheme to enhance generation stability. The method improves content-style disentanglement, allows flexible semantic control, and achieves high-fidelity generation without additional training overhead.
CRAFT-LoRA通过结合秩约束适配和无训练融合来解决个性化图像生成的挑战。它引入了秩约束的骨干微调方法以促进内容和风格的分离,一个提示引导的专家编码器以实现精确控制,以及一个无训练的梯度消除引导方案以增强生成稳定性。该方法提高了内容和风格的分离,允许灵活的语义控制,并在无需额外训练的情况下实现了高保真度的生成。
Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation
Authors: Runpei Dong, Ziyan Li, Xialin He, Saurabh Gupta
First: 2026-02-18T18:55:02+00:00 · Latest: 2026-02-24T06:15:16+00:00
Comments: Project page: https://hero-humanoid.github.io/
Abstract
Visual loco-manipulation of arbitrary objects in the wild with humanoid robots requires accurate end-effector (EE) control and a generalizable understanding of the scene via visual inputs (e.g., RGB-D images). Existing approaches are based on real-world imitation learning and exhibit limited generalization due to the difficulty in collecting large-scale training datasets. This paper presents a new paradigm, HERO, for object loco-manipulation with humanoid robots that combines the strong generalization and open-vocabulary understanding of large vision models with strong control performance from simulated training. We achieve this by designing an accurate residual-aware EE tracking policy. This EE tracking policy combines classical robotics with machine learning. It uses a) inverse kinematics to convert residual end-effector targets into reference trajectories, b) a learned neural forward model for accurate forward kinematics, c) goal adjustment, and d) replanning. Together, these innovations help us cut down the end-effector tracking error by 3.2x. We use this accurate end-effector tracker to build a modular system for loco-manipulation, where we use open-vocabulary large vision models for strong visual generalization. Our system is able to operate in diverse real-world environments, from offices to coffee shops, where the robot is able to reliably manipulate various everyday objects (e.g., mugs, apples, toys) on surfaces ranging from 43cm to 92cm in height. Systematic modular and end-to-end tests in simulation and the real world demonstrate the effectiveness of our proposed design. We believe the advances in this paper can open up new ways of training humanoid robots to interact with daily objects.
中文标题/摘要
标题:类人机器人开放词汇视觉移动物体末端执行器控制学习
使用类人机器人在野外对任意物体进行视觉移动物体操作需要精确的末端执行器(EE)控制和通过视觉输入(例如RGB-D图像)对场景的广泛理解。现有方法基于现实世界的模仿学习,由于难以收集大规模训练数据集,因此表现出有限的泛化能力。本文提出了一种新的范式HERO,用于类人机器人物体移动物体操作,结合了大型视觉模型的强大泛化能力和开放词汇理解与模拟训练中的强大控制性能。我们通过设计一种准确的残差感知末端执行器跟踪策略来实现这一点。该末端执行器跟踪策略结合了经典机器人学与机器学习。它使用a) 逆运动学将残差末端执行器目标转换为参考轨迹,b) 用于准确前向运动学的已学习神经前向模型,c) 目标调整,以及d) 重新规划。这些创新共同帮助我们将末端执行器跟踪误差降低了3.2倍。我们使用这种准确的末端执行器跟踪器构建了一个模块化移动物体系统,其中使用开放词汇大型视觉模型实现强大的视觉泛化。我们的系统能够在从办公室到咖啡馆等多样化的现实环境中操作,机器人能够可靠地操作各种日常物体(例如茶杯、苹果、玩具),这些物体位于43cm至92cm高度的表面上。在模拟和现实世界中的系统模块化和端到端测试表明我们提出的设计的有效性。我们认为本文中的进展可以为训练类人机器人与日常物体交互开辟新的训练方式。
Summary / 总结
This paper introduces HERO, a new paradigm for humanoid robots to perform object manipulation in diverse environments. It combines strong generalization from large vision models with accurate end-effector control through a residual-aware tracking policy. The policy integrates inverse kinematics, a learned forward model, goal adjustment, and replanning. Experimental results show a 3.2x reduction in end-effector tracking error, enabling reliable manipulation of various objects in real-world settings such as offices and coffee shops.
本文提出了HERO,一种新的框架,使类人机器人能够在多种环境中执行物体操作。该框架结合了大型视觉模型的强大泛化能力和通过残差感知末端执行器跟踪策略实现的精确控制。该策略整合了逆运动学、学习前向模型、目标调整和重规划。实验结果表明,末端执行器跟踪误差减少了3.2倍,使机器人能够在办公室和咖啡馆等真实世界环境中可靠地操作各种物体。
Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion
Authors: Jiaru Zhang, Manav Gagvani, Can Cui, Juntong Peng, Ruqi Zhang, Ziran Wang
First: 2026-02-24T05:59:10+00:00 · Latest: 2026-02-24T05:59:10+00:00
Abstract
Large Language Models (LLMs) and Vision-Language Models (VLMs) have emerged as promising candidates for end-to-end autonomous driving. However, these models typically face challenges in inference latency, action precision, and explainability. Existing autoregressive approaches struggle with slow token-by-token generation, while prior diffusion-based planners often rely on verbose, general-purpose language tokens that lack explicit geometric structure. In this work, we propose Masked Vision-Language-Action Diffusion for Autonomous Driving (MVLAD-AD), a novel framework designed to bridge the gap between efficient planning and semantic explainability via a masked vision-language-action diffusion model. Unlike methods that force actions into the language space, we introduce a discrete action tokenization strategy that constructs a compact codebook of kinematically feasible waypoints from real-world driving distributions. Moreover, we propose geometry-aware embedding learning to ensure that embeddings in the latent space approximate physical geometric metrics. Finally, an action-priority decoding strategy is introduced to prioritize trajectory generation. Extensive experiments on nuScenes and derived benchmarks demonstrate that MVLAD-AD achieves superior efficiency and outperforms state-of-the-art autoregressive and diffusion baselines in planning precision, while providing high-fidelity and explainable reasoning.
中文标题/摘要
标题:通过掩码视觉-语言-动作扩散实现高效可解释的端到端自动驾驶
大型语言模型(LLMs)和视觉-语言模型(VLMs)已成为端到端自动驾驶的有前途的候选者。然而,这些模型通常面临推理延迟、动作精度和可解释性方面的挑战。现有的自回归方法难以实现逐词生成,而先前的基于扩散的规划者往往依赖于冗长的一般语言标记,缺乏明确的几何结构。在本文中,我们提出了掩码视觉-语言-动作扩散模型(MVLAD-AD)用于自动驾驶,这是一种新型框架,旨在通过掩码视觉-语言-动作扩散模型弥合高效规划和语义可解释性之间的差距。与将动作强制放入语言空间的方法不同,我们引入了一种离散的动作标记策略,从实际驾驶分布中构建一个紧凑的动力学可行航点代码本。此外,我们提出了几何感知嵌入学习,以确保潜在空间中的嵌入近似物理几何度量。最后,我们引入了一种动作优先解码策略,以优先生成轨迹。在nuScenes及其衍生基准上的广泛实验表明,MVLAD-AD在规划精度方面优于最先进的自回归和扩散基线,同时提供高质量和可解释的推理。
Summary / 总结
This work addresses the challenges of inference latency, action precision, and explainability in end-to-end autonomous driving by proposing MVLAD-AD, a masked vision-language-action diffusion model. It introduces a discrete action tokenization strategy and geometry-aware embedding learning to ensure physical geometric metrics are approximated in the latent space. The model also includes an action-priority decoding strategy. Experiments show MVLAD-AD outperforms existing autoregressive and diffusion baselines in planning precision and provides high-fidelity explainable reasoning.
研究旨在通过利用掩码视觉-语言-动作扩散模型提高端到端自动驾驶系统的效率和可解释性。MVLAD-AD 引入了离散动作标记化策略和几何感知嵌入学习,以提升动作精度和可解释性。实验结果表明,MVLAD-AD 在规划精度和效率方面优于现有的自回归和扩散基础方法,并提供了高质量的推理。