Neu-PiG: Neural Preconditioned Grids for Fast Dynamic Surface Reconstruction on Long Sequences
Authors: Julian Kaltheuner, Hannah Dröge, Markus Plack, Patrick Stotko, Reinhard Klein
Venue: CVPR 2026
First: 2026-02-25T18:59:53+00:00 · Latest: 2026-02-25T18:59:53+00:00
Comments: CVPR 2026, Code: https://github.com/vc-bonn/neu-pig
Abstract
Temporally consistent surface reconstruction of dynamic 3D objects from unstructured point cloud data remains challenging, especially for very long sequences. Existing methods either optimize deformations incrementally, risking drift and requiring long runtimes, or rely on complex learned models that demand category-specific training. We present Neu-PiG, a fast deformation optimization method based on a novel preconditioned latent-grid encoding that distributes spatial features parameterized on the position and normal direction of a keyframe surface. Our method encodes entire deformations across all time steps at various spatial scales into a multi-resolution latent grid, parameterized by the position and normal direction of a reference surface from a single keyframe. This latent representation is then augmented for time modulation and decoded into per-frame 6-DoF deformations via a lightweight multilayer perceptron (MLP). To achieve high-fidelity, drift-free surface reconstructions in seconds, we employ Sobolev preconditioning during gradient-based training of the latent space, completely avoiding the need for any explicit correspondences or further priors. Experiments across diverse human and animal datasets demonstrate that Neu-PiG outperforms state-the-art approaches, offering both superior accuracy and scalability to long sequences while running at least 60x faster than existing training-free methods and achieving inference speeds on the same order as heavy pretrained models.
中文标题/摘要
标题:Neu-PiG:神经预条件网格用于长序列动态表面快速重建
从无序点云数据中对动态3D对象进行时空一致的表面重建仍然具有挑战性,尤其是在非常长的序列中。现有方法要么逐步优化变形,存在漂移风险并需要长时间运行,要么依赖于复杂的基于学习的模型,需要特定类别的训练。我们提出了Neu-PiG,这是一种基于新颖预条件隐网格编码的快速变形优化方法,该编码在关键帧表面的位置和法线方向上参数化空间特征。我们的方法将所有时间步长的变形在不同空间尺度上编码到多分辨率隐网格中,该隐网格由单个关键帧的参考表面的位置和法线方向参数化。然后,通过轻量级多层感知器(MLP)对时间进行调制并解码为每帧的6-DoF变形。为了在几秒钟内实现高保真度、无漂移的表面重建,我们在基于梯度的隐空间训练期间使用Sobolev预条件化,完全避免了任何显式对应关系或进一步先验的需求。跨不同的人类和动物数据集的实验表明,Neu-PiG 在准确性和长序列的可扩展性方面均优于现有方法,运行速度至少比现有无训练方法快60倍,并且推理速度与重型预训练模型相当。
Summary / 总结
Neu-PiG addresses the challenge of reconstructing temporally consistent surfaces from long sequences of unstructured point clouds. It uses a novel preconditioned latent-grid encoding to optimize deformations efficiently, avoiding drift and requiring no explicit correspondences. Experiments show Neu-PiG outperforms state-of-the-art methods in accuracy and scalability, running at least 60x faster than existing training-free methods and achieving similar inference speeds to heavy pretrained models.
Neu-PiG 解决了从长时间序列的无序点云数据中重建时空一致表面的挑战。它采用了一种新颖的预条件化隐空间网格编码来高效优化变形,避免了漂移和复杂的训练需求。实验表明,Neu-PiG 在准确性和可扩展性方面优于现有方法,实现了与重训练模型相当的快速推理速度。
When Safety Collides: Resolving Multi-Category Harmful Conflicts in Text-to-Image Diffusion via Adaptive Safety Guidance
Authors: Yongli Xiang, Ziming Hong, Zhaoqing Wang, Xiangyu Zhao, Bo Han, Tongliang Liu
Venue: CVPR 2026
First: 2026-02-24T13:20:31+00:00 · Latest: 2026-02-25T18:24:58+00:00
Comments: CVPR 2026; Code is released at https://github.com/tmllab/2026_CVPR_CASG
Abstract
Text-to-Image (T2I) diffusion models have demonstrated significant advancements in generating high-quality images, while raising potential safety concerns regarding harmful content generation. Safety-guidance-based methods have been proposed to mitigate harmful outputs by steering generation away from harmful zones, where the zones are averaged across multiple harmful categories based on predefined keywords. However, these approaches fail to capture the complex interplay among different harm categories, leading to "harmful conflicts" where mitigating one type of harm may inadvertently amplify another, thus increasing overall harmful rate. To address this issue, we propose Conflict-aware Adaptive Safety Guidance (CASG), a training-free framework that dynamically identifies and applies the category-aligned safety direction during generation. CASG is composed of two components: (i) Conflict-aware Category Identification (CaCI), which identifies the harmful category most aligned with the model's evolving generative state, and (ii) Conflict-resolving Guidance Application (CrGA), which applies safety steering solely along the identified category to avoid multi-category interference. CASG can be applied to both latent-space and text-space safeguards. Experiments on T2I safety benchmarks demonstrate CASG's state-of-the-art performance, reducing the harmful rate by up to 15.4% compared to existing methods.
中文标题/摘要
标题:当安全相冲突:通过自适应安全指导解决文本到图像扩散中的多类别有害冲突
文本到图像(T2I)扩散模型在生成高质量图像方面取得了显著进展,但同时也引发了关于有害内容生成的安全问题。基于安全指导的方法已被提出,通过引导生成远离预定义关键词定义的有害区域来减轻有害输出。然而,这些方法未能捕捉不同有害类别之间的复杂相互作用,导致“有害冲突”,即减轻一种有害类型的同时可能无意中放大另一种,从而增加整体有害率。为了解决这一问题,我们提出了冲突感知自适应安全指导(CASG),这是一种无需训练的框架,在生成过程中动态识别并应用与模型生成状态最一致的有害类别方向。CASG 包含两个组件:(i) 冲突感知类别识别(CaCI),识别与模型生成状态最一致的有害类别,(ii) 冲突解决指导应用(CrGA),仅沿识别的类别应用安全引导,以避免多类别干扰。CASG 可应用于潜在空间和文本空间的安全保护。在 T2I 安全基准上的实验表明,CASG 达到了最先进的性能,与现有方法相比,有害率最多降低了 15.4%。
Summary / 总结
The paper addresses the issue of harmful content generation in Text-to-Image (T2I) models by proposing Conflict-aware Adaptive Safety Guidance (CASG). CASG dynamically identifies and applies safety guidance aligned with the most relevant harmful category during generation, thereby reducing harmful content. Experiments show that CASG outperforms existing methods, decreasing the harmful rate by up to 15.4% on T2I safety benchmarks.
研究提出了冲突感知自适应安全指导(CASG)框架,以解决文本到图像扩散模型生成有害内容的问题。CASG 动态识别并应用类别对齐的安全方向,以避免多类别有害冲突,从而将有害率降低高达 15.4%,优于现有方法。该框架由冲突感知类别识别(CaCI)和冲突解决指导应用(CrGA)两个组件组成,共同有效缓解有害内容生成问题。
LLaDA-MedV: Exploring Large Language Diffusion Models for Biomedical Image Understanding
Authors: Xuanzhao Dong, Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Peijie Qiu, Shao Tang, Xin Li, Yalin Wang
First: 2025-08-03T06:46:46+00:00 · Latest: 2026-02-25T18:15:23+00:00
Abstract
Autoregressive models (ARMs) have long dominated the landscape of biomedical vision-language models (VLMs). Recently, masked diffusion models such as LLaDA have emerged as promising alternatives, yet their application in the biomedical domain remains largely underexplored. To bridge this gap, we introduce LLaDA-MedV, the first large language diffusion model tailored for biomedical image understanding through vision instruction tuning. LLaDA-MedV achieves relative performance gains of 7.855% over LLaVA-Med and 1.867% over LLaDA-V in the open-ended biomedical visual conversation task, and sets new state-of-the-art accuracy on the closed-form subset of three VQA benchmarks: 84.93% on VQA-RAD, 92.31% on SLAKE, and 95.15% on PathVQA. Furthermore, a detailed comparison with LLaVA-Med suggests that LLaDA-MedV is capable of generating reasonably longer responses by explicitly controlling response length, which can lead to more informative outputs. We also conduct an in-depth analysis of both the training and inference stages, highlighting the critical roles of initialization weight selection, fine-tuning strategies, and the interplay between sampling steps and response repetition. The code and model weight is released at https://github.com/LLM-VLM-GSL/LLaDA-MedV.
中文标题/摘要
标题:LLaDA-MedV:探索大规模语言扩散模型在生物医学图像理解中的应用
自回归模型(ARMs)长期以来主导了生物医学视觉语言模型(VLMs)的领域。最近,掩码扩散模型如LLaDA崭露头角,成为有前途的替代方案,但在生物医学领域的应用仍然相对较少。为弥合这一差距,我们引入了LLaDA-MedV,这是第一个针对生物医学图像理解的大型语言扩散模型,通过视觉指令调优。LLaDA-MedV在开放式的生物医学视觉对话任务中相对于LLaVA-Med实现了7.855%的相对性能提升,相对于LLaDA-V实现了1.867%的提升,并在三个VQA基准测试的封闭形式子集上设定了新的最先进准确率:84.93%的VQA-RAD,92.31%的SLAKE,95.15%的PathVQA。此外,与LLaVA-Med的详细比较表明,LLaDA-MedV能够通过显式控制响应长度生成更长的合理响应,这可能导致更具信息量的输出。我们还对训练和推理阶段进行了深入分析,突出了初始化权重选择、微调策略以及采样步骤与响应重复之间相互作用的关键作用。代码和模型权重在https://github.com/LLM-VLM-GSL/LLaDA-MedV上发布。
Summary / 总结
This paper introduces LLaDA-MedV, a large language diffusion model specifically designed for biomedical image understanding. It leverages vision instruction tuning and achieves significant performance gains over existing models in open-ended biomedical visual conversation tasks. LLaDA-MedV sets new state-of-the-art accuracy on three VQA benchmarks and demonstrates the ability to generate longer and more informative responses through explicit control of response length.
该研究引入了LLaDA-MedV,这是一种专门针对生物医学图像理解的大语言扩散模型。受生物医学领域中掩码扩散模型应用不足的驱动,LLaDA-MedV在开放式的生物医学视觉对话任务中分别比LLaVA-Med和LLaDA-V高出7.855%和1.867%。此外,该模型在三个VQA基准测试中的得分分别为84.93%、92.31%和95.15%,达到了新的最先进水平。通过训练和推理阶段的详细比较和分析,突显了初始化权重选择、微调策略以及采样步骤与响应重复之间的相互作用的重要性。
Spilled Energy in Large Language Models
Authors: Adrian Robert Minut, Hazem Dewidar, Iacopo Masi
First: 2026-02-21T00:38:47+00:00 · Latest: 2026-02-25T18:09:08+00:00
Abstract
We reinterpret the final Large Language Model (LLM) softmax classifier as an Energy-Based Model (EBM), decomposing the sequence-to-sequence probability chain into multiple interacting EBMs at inference. This principled approach allows us to track "energy spills" during decoding, which we empirically show correlate with factual errors, biases, and failures. Similar to Orgad et al. (2025), our method localizes the exact answer token and subsequently tests for hallucinations. Crucially, however, we achieve this without requiring trained probe classifiers or activation ablations. Instead, we introduce two completely training-free metrics derived directly from output logits: spilled energy, which captures the discrepancy between energy values across consecutive generation steps that should theoretically match, and marginalized energy, which is measurable at a single step. Evaluated on nine benchmarks across state-of-the-art LLMs (including LLaMA, Mistral, and Gemma) and on synthetic algebraic operations (Qwen3), our approach demonstrates robust, competitive hallucination detection and cross-task generalization. Notably, these results hold for both pretrained and instruction-tuned variants without introducing any training overhead.
中文标题/摘要
标题:大型语言模型中的溢出能量
我们将大型语言模型(LLM)的最终Softmax分类器重新解释为能量基模型(EBM),在推理过程中将序列到序列的概率链分解为多个相互作用的EBM。这种原则性的方法使我们能够追踪解码过程中的“能量溢出”,我们实验证明这些能量溢出与事实错误、偏见和失败相关。类似于Orgad等人(2025),我们的方法定位到确切的答案标记,然后测试幻觉。然而,我们通过这种方法并不需要训练探针分类器或激活消融。相反,我们引入了两个完全无需训练的度量标准,直接从输出logits中得出:溢出能量,它捕捉了理论上应匹配的能量值在连续生成步骤之间的差异;以及边际能量,它可以在单个步骤中进行测量。在九个基准测试上评估了最先进的LLM(包括LLaMA、Mistral和Gemma),以及合成的代数运算(Qwen3),我们的方法展示了稳健且具有竞争力的幻觉检测和跨任务泛化能力。值得注意的是,这些结果对于预训练和指令微调的变体都适用,且不引入任何训练开销。
Summary / 总结
The study reinterprets the final softmax classifier of Large Language Models (LLMs) as an Energy-Based Model (EBM) to track 'energy spills' during decoding, which are correlated with factual errors, biases, and failures. The method introduces two training-free metrics, spilled energy and marginalized energy, to detect hallucinations without requiring trained probe classifiers or activation ablations. Evaluated on nine benchmarks and synthetic algebraic operations, the approach shows robust and competitive hallucination detection and cross-task generalization for both pretrained and instruction-tuned LLMs without additional training overhead.
研究将大型语言模型的最终softmax分类器重新解释为能量基模型,以追踪解码过程中的‘能量溢出’,这些溢出与事实错误、偏见和失败相关。该方法引入了两个无需训练的度量标准,即溢出能量和边际能量,用于检测幻觉,无需使用探针分类器或激活层分析。在多种基准测试和合成代数运算上的评估表明,该方法在预训练和指令微调模型上均能实现稳健的幻觉检测和跨任务泛化,且无需额外的训练开销。
NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Authors: Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang
First: 2026-02-25T17:50:41+00:00 · Latest: 2026-02-25T17:50:41+00:00
Comments: Code: https://github.com/lingfengren/NoLan
Abstract
Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: Which component of the LVLM pipeline primarily contributes to object hallucinations? The vision encoder to perceive visual information, or the language decoder to generate text responses? In this work, we strive to answer this question through designing a systematic experiment to analyze the roles of the vision encoder and the language decoder in hallucination generation. Our observations reveal that object hallucinations are predominantly associated with the strong priors from the language decoder. Based on this finding, we propose a simple and training-free framework, No-Language-Hallucination Decoding, NoLan, which refines the output distribution by dynamically suppressing language priors, modulated based on the output distribution difference between multimodal and text-only inputs. Experimental results demonstrate that NoLan effectively reduces object hallucinations across various LVLMs on different tasks. For instance, NoLan achieves substantial improvements on POPE, enhancing the accuracy of LLaVA-1.5 7B and Qwen-VL 7B by up to 6.45 and 7.21, respectively. The code is publicly available at: https://github.com/lingfengren/NoLan.
中文标题/摘要
标题:NoLan:通过动态抑制语言先验减轻大型视觉-语言模型中的对象幻觉
对象幻觉是大型视觉-语言模型(LVLM)中的一个关键问题,模型的输出中包含输入图像中不存在的对象。从这一现象中自然会引发一个疑问:在LVLM流水线中,哪一部分主要导致了对象幻觉的产生?是用于感知视觉信息的视觉编码器,还是用于生成文本响应的语言解码器?在本研究中,我们通过设计系统实验来分析视觉编码器和语言解码器在幻觉生成中的作用。我们的观察表明,对象幻觉主要与语言解码器中的强大先验有关。基于这一发现,我们提出了一种简单且无需训练的框架,No-Language-Hallucination Decoding(NoLan),通过动态抑制语言先验来细化输出分布,该抑制基于多模态输入和纯文本输入输出分布之间的差异进行调节。实验结果表明,NoLan在不同任务的多种LVLM中有效减少了对象幻觉。例如,NoLan在POPE上取得了显著改进,分别提高了LLaVA-1.5 7B和Qwen-VL 7B的准确性6.45和7.21。代码可在:https://github.com/lingfengren/NoLan公开获取。
Summary / 总结
This paper addresses the issue of object hallucinations in Large Vision-Language Models (LVLMs) by investigating the roles of the vision encoder and language decoder. Through systematic experiments, it is found that language decoder priors are primarily responsible for object hallucinations. To mitigate this, the authors propose NoLan, a training-free framework that suppresses language priors dynamically based on the difference between multimodal and text-only inputs, leading to significant reductions in object hallucinations across various LVLMs on different tasks.
该论文通过研究视觉编码器和语言解码器在大型视觉语言模型中的作用,解决了物体幻觉的问题。提出了一种名为NoLan的框架,通过动态抑制语言先验来减轻幻觉现象。实验结果显示,NoLan显著减少了各种LVLM在不同任务上的物体幻觉,例如在POPE任务上分别提高了LLaVA-1.5 7B和Qwen-VL 7B的准确性,最多可达6.45和7.21。
Recursive Belief Vision Language Action Models
Authors: Vaidehi Bagaria, Bijo Sebastian, Nirav Kumar Patel
First: 2026-02-24T08:02:16+00:00 · Latest: 2026-02-25T17:38:24+00:00
Abstract
Vision-language-action models must enable agents to execute long-horizon tasks under partial observability. However, most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. While semantic grounding is important, long-horizon manipulation fundamentally requires persistent, action-conditioned state representations. Current VLAs lack such representations and exhibit limited temporal and physical reasoning, making them ill-suited for multi-stage control. This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. Queried once per task, the VLM provides high-level intent, while the belief tracks task progress and enables phase-aware, causally grounded control under partial observability without storing raw observations or scaling memory with time. The belief and intent jointly condition a diffusion policy for robust closed-loop execution. RB-VLA outperforms prior VLAs on long-horizon benchmarks, achieving 52.5 percent and 37.5 percent higher success rates on multi-stage pick-and-place and stacking tasks, respectively, compared to pi_0. It also reduces inference latency by up to five times relative to baselines and eliminates memory growth across timesteps observed in existing VLAs. Ablations show the belief module is the primary driver of performance, increasing success rates from 32.5 percent without belief to 77.5 percent with belief.
中文标题/摘要
标题:递归信念视语言行动模型
视语言行动模型必须使代理能够在部分可观测性下执行长时任务。然而,大多数现有方法仍依赖于短上下文窗口或反复查询视语言模型(VLM),这导致任务进展丢失、感知同义词下的动作重复以及高推理延迟。虽然语义定位很重要,但长时操作本质上需要持久的、基于动作的状态表示。当前的VLAs缺乏这样的表示,且在时间和物理推理方面表现出有限的能力,使其不适合多阶段控制。本文引入了RB-VLA,这是一种以信念为中心的架构,通过自我监督的世界模型目标进行训练,保持一个紧凑的潜在状态编码任务相关的历史、动力学和物体交互。VLM在每次任务时查询一次,提供高层次的意图,而信念追踪任务进展,在部分可观测性下实现有阶段意识的因果控制,无需存储原始观察或随时间扩展内存。信念和意图共同条件一个扩散策略,以实现稳健的闭环执行。RB-VLA在长时任务基准测试中优于先前的VLAs,分别在多阶段取放和堆叠任务中实现了52.5%和37.5%更高的成功率,相比pi_0。它还将推理延迟降低了最多五倍,并消除了现有VLAs在时间步长上观察到的内存增长。消融实验表明,信念模块是性能的主要驱动因素,信念模块从无到有将成功率从32.5%提高到77.5%。
Summary / 总结
This paper addresses the limitations of existing vision-language-action models by introducing RB-VLA, a belief-centric architecture. RB-VLA uses self-supervised world-model objectives to maintain a compact latent state that tracks task progress and enables phase-aware control. The model outperforms prior approaches on long-horizon tasks, achieving higher success rates and reducing inference latency. Ablation studies confirm the belief module's critical role in performance improvement.
研究旨在改进视觉-语言-动作模型以执行部分可观测条件下的长期任务,解决任务进度丢失和推理延迟高等问题。方法是采用信念为中心的架构RB-VLA,通过自监督世界模型目标维护紧凑的潜在状态。这使得模型能够通过一次查询视觉-语言模型提供高层次意图,而信念则跟踪任务进度并实现阶段感知的因果控制。关键发现包括RB-VLA在多阶段拾放和堆叠任务上的表现优于先前模型,成功率分别高出52.5%和37.5%,并且将推理延迟降低了最多五倍。消融研究显示信念模块显著提高了性能。
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
Authors: Jingxuan Zhang, Yunta Hsieh, Zhongwei Wan, Haokun Lin, Xin Wang, Ziqi Wang, Yingtie Lei, Mi Zhang
First: 2026-02-23T19:55:54+00:00 · Latest: 2026-02-25T17:11:08+00:00
Comments: CVPR2026
Abstract
Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, and delivers a 1.22x speedup in end-to-end inference latency, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.
中文标题/摘要
标题:QuantVLA:面向视觉-语言-行动模型的规模校准后训练量化
视觉-语言-行动(VLA)模型将感知、语言和控制统一起来,为具身智能体服务,但由于计算和内存需求迅速增加,尤其是在模型扩展到更长的时间范围和更大的骨干网络时,它们在实际部署中面临重大挑战。为了解决这些瓶颈,我们提出了QuantVLA,这是一种无需训练的后训练量化(PTQ)框架,据我们所知,这是第一个针对VLA系统的PTQ方法,并且是第一个成功量化扩散变压器(DiT)动作头的方法。QuantVLA 包含三个规模校准组件:(1)一种选择性量化布局,将语言骨干和DiT中的所有线性层转换为整数,而保留注意力投影为浮点数,以保持原始操作计划;(2)注意力温度匹配,一种轻量级的每头缩放机制,稳定注意力概率,并在推理时折叠到去量化尺度中;(3)输出头平衡,一种每层残差接口校准,减轻后投影能量漂移。该框架不需要额外的训练,仅使用少量未标记的校准缓冲区,并支持低位宽权重和激活的整数内核,同时保持架构不变。在LIBERO上的代表性VLA模型上,QuantVLA 超过了全精度基线的任务成功率,实现了约70%的量化组件相对内存节省,并提供了1.22倍的端到端推理延迟加速,为在严格的计算、内存和功率限制下实现可扩展的低位宽具身智能提供了实际途径。
Summary / 总结
QuantVLA is a training-free post-training quantization framework for vision-language-action models, addressing the compute and memory challenges of scaling these models. It includes selective quantization, attention temperature matching, and output head balancing to preserve model performance. QuantVLA improves task success rates, reduces memory usage by about 70%, and accelerates inference by 1.22x, making low-bit embodied intelligence practical under strict resource constraints.
QuantVLA 是一种无需训练的后训练量化框架,用于解决视觉-语言-行动模型在扩展时面临的计算和内存挑战。它引入了三种缩放校准组件:选择性量化布局、注意力温度匹配和输出头平衡。QuantVLA 在 LIBERO 模型上实现了超过全精度基线的任务成功率、70% 的相对内存节省和 1.22 倍的端到端推理延迟加速。
GeoDiv: Framework For Measuring Geographical Diversity In Text-To-Image Models
Authors: Abhipsa Basu, Mohana Singh, Shashank Agnihotri, Margret Keuper, R. Venkatesh Babu
Venue: ICLR 2026
First: 2026-02-25T17:08:43+00:00 · Latest: 2026-02-25T17:08:43+00:00
Comments: ICLR 2026
Abstract
Text-to-image (T2I) models are rapidly gaining popularity, yet their outputs often lack geographical diversity, reinforce stereotypes, and misrepresent regions. Given their broad reach, it is critical to rigorously evaluate how these models portray the world. Existing diversity metrics either rely on curated datasets or focus on surface-level visual similarity, limiting interpretability. We introduce GeoDiv, a framework leveraging large language and vision-language models to assess geographical diversity along two complementary axes: the Socio-Economic Visual Index (SEVI), capturing economic and condition-related cues, and the Visual Diversity Index (VDI), measuring variation in primary entities and backgrounds. Applied to images generated by models such as Stable Diffusion and FLUX.1-dev across $10$ entities and $16$ countries, GeoDiv reveals a consistent lack of diversity and identifies fine-grained attributes where models default to biased portrayals. Strikingly, depictions of countries like India, Nigeria, and Colombia are disproportionately impoverished and worn, reflecting underlying socio-economic biases. These results highlight the need for greater geographical nuance in generative models. GeoDiv provides the first systematic, interpretable framework for measuring such biases, marking a step toward fairer and more inclusive generative systems. Project page: https://abhipsabasu.github.io/geodiv
中文标题/摘要
标题:GeoDiv:衡量文本到图像模型地理多样性框架
文本到图像(T2I)模型正迅速获得 popularity,但其输出往往缺乏地理多样性,强化刻板印象并错误地代表地区。鉴于其广泛的影响力,严格评估这些模型如何呈现世界至关重要。现有多样性指标要么依赖于精心策划的数据集,要么专注于表面视觉相似性,限制了可解释性。我们引入了 GeoDiv,一种利用大规模语言和视觉语言模型来评估地理多样性的框架,沿着两个互补轴线:社会经济视觉指数(SEVI),捕捉经济和状况相关的线索,以及视觉多样性指数(VDI),衡量主要实体和背景的变化。应用于 Stable Diffusion 和 FLUX.1-dev 生成的图像,GeoDiv 揭示了一致的缺乏多样性,并识别出模型在细微属性上默认的偏见表现。令人惊讶的是,像印度、尼日利亚和哥伦比亚这样的国家的描绘过于贫困和破旧,反映了潜在的社会经济偏见。这些结果突显了在生成模型中需要更大的地理细微差别。GeoDiv 提供了第一个系统且可解释的框架来衡量此类偏见,标志着朝着更公平和包容的生成系统迈出的一步。项目页面:https://abhipsabasu.github.io/geodiv
Summary / 总结
GeoDiv is a framework designed to measure the geographical diversity in text-to-image models, addressing the lack of diversity and stereotypes in model outputs. It uses large language and vision-language models to evaluate diversity along two axes: the Socio-Economic Visual Index (SEVI) and the Visual Diversity Index (VDI). Applied to images from models like Stable Diffusion and FLUX.1-dev, GeoDiv found that countries like India, Nigeria, and Colombia are disproportionately depicted as impoverished and worn, indicating socio-economic biases. This highlights the need for more nuanced and inclusive generative models.
GeoDiv 是一个用于评估文本到图像模型中地理多样性的框架,解决了模型输出中缺乏多样性和刻板印象的问题。它利用大型语言和视觉语言模型从社会经济视觉指数(SEVI)和视觉多样性指数(VDI)两个维度进行评估。应用于 Stable Diffusion 和 FLUX.1-dev 生成的图像,GeoDiv 发现模型往往将印度、尼日利亚和哥伦比亚等国家描绘为极度贫困,反映了社会经济偏见。这项工作提供了一种系统且可解释的方法来评估和缓解生成模型中的这些偏见。
TIPS Over Tricks: Simple Prompts for Effective Zero-shot Anomaly Detection
Authors: Alireza Salehi, Ehsan Karami, Sepehr Noey, Sahand Noey, Makoto Yamada, Reshad Hosseini, Mohammad Sabokrou
Venue: ICASSP
First: 2026-02-03T14:48:11+00:00 · Latest: 2026-02-25T17:00:45+00:00
Comments: This is the extended version of the paper accepted in ICASSP'26, which will be publicly available in May. Authors' contributions may vary among the versions
Abstract
Anomaly detection identifies departures from expected behavior in safety-critical settings. When target-domain normal data are unavailable, zero-shot anomaly detection (ZSAD) leverages vision-language models (VLMs). However, CLIP's coarse image-text alignment limits both localization and detection due to (i) spatial misalignment and (ii) weak sensitivity to fine-grained anomalies; prior works compensate with complex auxiliary modules yet largely overlook the choice of backbone. We revisit the backbone and use TIPS-a VLM trained with spatially aware objectives. While TIPS alleviates CLIP's issues, it exposes a distributional gap between global and local features. We address this with decoupled prompts-fixed for image-level detection and learnable for pixel-level localization-and by injecting local evidence into the global score. Without CLIP-specific tricks, our TIPS-based pipeline improves image-level performance by 1.1-3.9% and pixel-level by 1.5-6.9% across seven industrial datasets, delivering strong generalization with a lean architecture. Code is available at github.com/AlirezaSalehy/Tipsomaly.
中文标题/摘要
标题:TIPS胜于技巧:简单提示实现有效的零样本异常检测
异常检测在安全关键环境中识别偏离预期行为。当目标域正常数据不可用时,零样本异常检测(ZSAD)利用视觉语言模型(VLMs)。然而,CLIP的粗略图像-文本对齐限制了定位和检测,原因在于(i)空间对齐不准确和(ii)对细微异常的敏感性较弱;先前的工作通过复杂的辅助模块进行补偿,但很大程度上忽视了主干网络的选择。我们重新审视了主干网络,并使用TIPS-VLM,该模型通过空间感知目标进行训练。虽然TIPS缓解了CLIP的问题,但它暴露了全局特征和局部特征之间的分布差距。我们通过分离的提示(固定用于图像级检测,可学习用于像素级定位)和将局部证据注入全局评分来解决这一问题。在不使用CLIP特定技巧的情况下,基于TIPS的管道在七个工业数据集上分别提高了图像级性能1.1-3.9%和像素级性能1.5-6.9%,实现了强大的泛化能力,同时保持了简洁的架构。代码可在github.com/AlirezaSalehy/Tipsomaly获取。
Summary / 总结
The paper addresses the challenge of zero-shot anomaly detection in scenarios where target-domain normal data is unavailable. It proposes using TIPS, a vision-language model trained with spatially aware objectives, to improve anomaly detection. The method includes decoupled prompts for image-level detection and learnable prompts for pixel-level localization, which helps in addressing the distributional gap between global and local features. The results show that the TIPS-based approach outperforms previous methods, achieving improvements of 1.1-3.9% in image-level performance and 1.5-6.9% in pixel-level performance across seven industrial datasets, while maintaining strong generalization with a simpler architecture.
论文针对目标领域正常数据不可用的安全关键设置中的零样本异常检测挑战,提出使用具有空间感知目标训练的视觉-语言模型TIPS来改进异常检测。方法包括用于图像级检测的固定提示和用于像素级定位的学习提示,将局部证据注入全局评分中。TIPS基线方法在七个工业数据集上分别提高了1.1-3.9%的图像级检测性能和1.5-6.9%的像素级检测性能,展示了强大的泛化能力与简洁的架构。
Brain3D: Brain Report Automation via Inflated Vision Transformers in 3D
Authors: Mariano Barone, Francesco Di Serio, Giuseppe Riccio, Antonio Romano, Marco Postiglione, Antonino Ferraro, Vincenzo Moscato
First: 2026-02-25T16:46:45+00:00 · Latest: 2026-02-25T16:46:45+00:00
Abstract
Current medical vision-language models (VLMs) process volumetric brain MRI using 2D slice-based approximations, fragmenting the spatial context required for accurate neuroradiological interpretation. We developed \textbf{Brain3D}, a staged vision-language framework for automated radiology report generation from 3D brain tumor MRI. Our approach inflates a pretrained 2D medical encoder into a native 3D architecture and progressively aligns it with a causal language model through three stages: contrastive grounding, supervised projector warmup, and LoRA-based linguistic specialization. Unlike generalist 3D medical VLMs, \textbf{Brain3D} is tailored to neuroradiology, where hemispheric laterality, tumor infiltration patterns, and anatomical localization are critical. Evaluated on 468 subjects (BraTS pathological cases plus healthy controls), our model achieves a Clinical Pathology F1 of 0.951 versus 0.413 for a strong 2D baseline while maintaining perfect specificity on healthy scans. The staged alignment proves essential: contrastive grounding establishes visual-textual correspondence, projector warmup stabilizes conditioning, and LoRA adaptation shifts output from verbose captions to structured clinical reports\footnote{Our code is publicly available for transparency and reproducibility
中文标题/摘要
标题:Brain3D:通过膨胀的3D视觉变换器进行脑部报告自动化
当前的医学视觉-语言模型(VLMs)使用2D切片近似处理脑部MRI体积数据,这会分割出准确神经放射学解释所需的时空上下文。我们开发了**Brain3D**,这是一种分阶段的视觉-语言框架,用于从3D脑肿瘤MRI生成自动化放射学报告。我们的方法将预训练的2D医学编码器膨胀为原生3D架构,并通过三个阶段逐步与因果语言模型对齐:对比性语义定位、监督投影预热和基于LoRA的语言专业化。与通用的3D医学VLMs不同,**Brain3D**专门针对神经放射学,其中半球侧向性、肿瘤浸润模式和解剖定位至关重要。在468名受试者(包括BraTS病理病例和健康对照)上进行评估,我们的模型在临床病理F1分数上达到0.951,而强2D基线模型仅为0.413,同时在健康扫描上保持完美的特异性。分阶段对齐至关重要:对比性语义定位建立视觉-文本对应关系,投影预热稳定条件,而LoRA适应使输出从冗长的描述性标题转变为结构化的临床报告。我们的代码已公开以确保透明性和可再现性
Summary / 总结
Brain3D is a staged vision-language framework designed for automated radiology report generation from 3D brain tumor MRI. It inflates a 2D medical encoder into a 3D architecture and progressively aligns it with a causal language model through three stages: contrastive grounding, supervised projector warmup, and LoRA-based linguistic specialization. Evaluated on 468 subjects, Brain3D achieves a Clinical Pathology F1 score of 0.951, outperforming a strong 2D baseline and maintaining perfect specificity on healthy scans. The staged alignment is crucial, as it establishes visual-textual correspondence, stabilizes conditioning, and shifts output to structured clinical reports.
Brain3D 是一种分阶段的视觉-语言框架,用于从 3D 脑肿瘤 MRI 中自动生成放射学报告。它将一个预训练的 2D 医疗编码器扩展为 3D 架构,并通过三个阶段逐步与因果语言模型对齐:对比性 grounding、监督投影预热和 LoRA 基础的语言专业化。在 468 个受试者(包括 BraTS 病理病例和健康对照)上进行评估,Brain3D 达到了 0.951 的临床病理 F1 分数,显著优于 2D 基线,同时在健康扫描上保持了完美的特异性。分阶段的对齐对于建立视觉-文本对应关系和将输出转换为结构化的临床报告至关重要。
Training-Free Generative Modeling via Kernelized Stochastic Interpolants
Authors: Florentin Coeurdoux, Etienne Lempereur, Nathanaël Cuvelle-Magar, Thomas Eboli, Stéphane Mallat, Anastasia Borovykh, Eric Vanden-Eijnden
First: 2026-02-23T17:26:09+00:00 · Latest: 2026-02-25T16:39:12+00:00
Abstract
We develop a kernel method for generative modeling within the stochastic interpolant framework, replacing neural network training with linear systems. The drift of the generative SDE is $\hat b_t(x) = \nablaφ(x)^\topη_t$, where $η_t\in\R^P$ solves a $P\times P$ system computable from data, with $P$ independent of the data dimension $d$. Since estimates are inexact, the diffusion coefficient $D_t$ affects sample quality; the optimal $D_t^*$ from Girsanov diverges at $t=0$, but this poses no difficulty and we develop an integrator that handles it seamlessly. The framework accommodates diverse feature maps -- scattering transforms, pretrained generative models etc. -- enabling training-free generation and model combination. We demonstrate the approach on financial time series, turbulence, and image generation.
Summary / 总结
The paper presents a kernel-based method for generative modeling using stochastic interpolants, avoiding the need for neural network training. The drift of the generative SDE is defined by the gradient of a feature map and a solution to a linear system derived from data. The diffusion coefficient is crucial for sample quality, and an optimal value is derived from Girsanov's theorem. The method can use various feature maps and allows for training-free generation and model combination. Experiments show the approach's effectiveness on financial time series, turbulence, and image generation.
该论文提出了一种使用随机插值的核方法进行生成建模,无需进行神经网络训练。生成SDE的漂移由特征映射φ的梯度与从数据中导出的线性系统解η_t点积定义。扩散系数D_t影响样本质量,但最优的D_t^*在t=0时发散,通过开发的积分器可以无缝处理这一问题。该方法支持多种特征映射,并允许无训练生成和模型组合。在金融时间序列、湍流和图像生成上的实验展示了有希望的结果。
Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers
Authors: Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun
First: 2025-07-11T09:07:43+00:00 · Latest: 2026-02-25T16:14:20+00:00
Abstract
Diffusion transformers (DiTs) offer excellent scalability for high-fidelity generation, but their computational overhead poses a great challenge for practical deployment. Existing acceleration methods primarily exploit the temporal dimension, whereas spatial acceleration remains underexplored. In this work, we investigate spatial acceleration for DiTs via latent upsampling. We found that naïve latent upsampling for spatial acceleration introduces artifacts, primarily due to aliasing in high-frequency edge regions and mismatching from noise-timestep discrepancies. Then, based on these findings and analyses, we propose a training-free spatial acceleration framework, dubbed Region-Adaptive Latent Upsampling (RALU), to mitigate those artifacts while achieving spatial acceleration of DiTs by our mixed-resolution latent upsampling. RALU achieves artifact-free, efficient acceleration with early upsampling only on artifact-prone edge regions and noise-timestep matching for different latent resolutions, leading to up to 7.0$\times$ speedup on FLUX-1.dev and 3.0$\times$ on Stable Diffusion 3 with negligible quality degradation. Furthermore, our RALU is complementarily applicable to existing temporal acceleration methods and timestep-distilled models, leading to up to 15.9$\times$ speedup.
中文标题/摘要
标题:无训练混合分辨率潜在上采样以实现空间加速扩散变换器
扩散变换器(DiTs)提供了出色的高保真生成可扩展性,但其计算开销对实际部署构成了巨大挑战。现有加速方法主要利用时间维度,而空间加速则被严重忽视。在本文中,我们通过潜在上采样研究了DiTs的空间加速。我们发现,简单的空间加速潜在上采样引入了伪影,主要是由于高频边缘区域的混叠和噪声时间步长差异导致的不匹配。基于这些发现和分析,我们提出了一种无需训练的空间加速框架,称为区域自适应潜在上采样(RALU),以减轻这些伪影并实现DiTs的空间加速。RALU通过仅在易产生伪影的边缘区域进行早期上采样和不同潜在分辨率的噪声时间步长匹配,实现了无伪影、高效的加速,分别在FLUX-1.dev和Stable Diffusion 3上实现了最高7.0倍和3.0倍的加速,且质量下降可忽略不计。此外,我们的RALU可以补充现有时间加速方法和时间步长提炼模型,实现最高15.9倍的加速。
Summary / 总结
This work addresses the computational challenge of deploying diffusion transformers (DiTs) by proposing a training-free spatial acceleration framework called Region-Adaptive Latent Upsampling (RALU). The method mitigates artifacts from naive latent upsampling by selectively upsample only edge regions and ensuring noise-timestep matching across different latent resolutions. RALU achieves up to 7.0× speedup on FLUX-1.dev and 3.0× on Stable Diffusion 3 with minimal quality loss, and can be combined with existing temporal acceleration methods for even greater efficiency.
本文提出了一种无训练的时空加速框架Region-Adaptive Latent Upsampling (RALU),通过仅在边缘区域进行早期上采样并确保不同潜空间分辨率的噪声时间步匹配,来缓解从简单潜空间上采样引入的伪影,从而在FLUX-1.dev上实现最高7.0×的加速,在Stable Diffusion 3上实现最高3.0×的加速,且质量几乎没有下降。此外,RALU还可以与现有的时间加速方法兼容,实现最高15.9×的加速。
Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers
Authors: Guandong Li
First: 2026-02-20T06:24:20+00:00 · Latest: 2026-02-25T15:33:35+00:00
Abstract
Training-free control over editing intensity is a critical requirement for diffusion-based image editing models built on the Diffusion Transformer (DiT) architecture. Existing attention manipulation methods focus exclusively on the Key space to modulate attention routing, leaving the Value space -- which governs feature aggregation -- entirely unexploited. In this paper, we first reveal that both Key and Value projections in DiT's multi-modal attention layers exhibit a pronounced bias-delta structure, where token embeddings cluster tightly around a layer-specific bias vector. Building on this observation, we propose Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key channel (controlling where to attend) and the Value channel (controlling what to aggregate). We provide a theoretical analysis showing that the Key channel operates through the nonlinear softmax function, acting as a coarse control knob, while the Value channel operates through linear weighted summation, serving as a fine-grained complement. Together, the two-dimensional parameter space $(δ_k, δ_v)$ enables more precise editing-fidelity trade-offs than any single-channel method. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing categories) demonstrate that DCAG consistently outperforms Key-only guidance across all fidelity metrics, with the most significant improvements observed in localized editing tasks such as object deletion (4.9% LPIPS reduction) and object addition (3.2% LPIPS reduction).
中文标题/摘要
标题:扩散变换器中的双通道注意力引导无训练编辑控制
基于扩散变换器(DiT)架构的扩散基础图像编辑模型对无训练编辑强度控制提出了关键要求。现有的注意力操作方法仅专注于键空间来调节注意力路由,而完全忽略了值空间——它控制特征聚合。在本文中,我们首先揭示了DiT多模态注意力层中的键投影和值投影都表现出明显的偏差-增量结构,其中令牌嵌入紧密围绕特定层的偏差向量聚类。基于这一观察,我们提出了双通道注意力引导(DCAG),这是一种无训练框架,可以同时操作键通道(控制注意力的方向)和值通道(控制聚合的内容)。我们提供了理论分析,表明键通道通过非线性softmax函数操作,作为粗略的控制旋钮,而值通道通过线性加权和操作,作为精细的补充。两者结合的二维参数空间$(δ_k, δ_v)$能够比任何单通道方法提供更精确的编辑保真度权衡。在PIE-Bench基准(700张图像,10种编辑类别)上的广泛实验表明,DCAG在所有保真度指标上都优于仅键引导,特别是在局部编辑任务如对象删除(LPIPS减少4.9%)和对象添加(LPIPS减少3.2%)方面取得了最显著的改进。
Summary / 总结
This paper addresses the need for training-free control over editing intensity in diffusion-based image editing models using the Diffusion Transformer (DiT) architecture. It introduces Dual-Channel Attention Guidance (DCAG), which manipulates both the Key and Value channels to control attention and feature aggregation, respectively. Experiments show that DCAG outperforms Key-only guidance in terms of editing fidelity, particularly in localized editing tasks like object deletion and addition, reducing LPIPS by 4.9% and 3.2%, respectively.
本文针对使用Diffusion Transformer (DiT)架构的基于扩散的图像编辑模型中训练免费的编辑强度控制需求,提出了Dual-Channel Attention Guidance (DCAG)框架,同时操纵Key通道和Value通道以实现更精确的编辑。理论分析表明,Key通道控制注意力的方向,而Value通道控制特征聚合。实验在PIE-Bench基准上显示,DCAG在局部编辑任务中优于Key-only指导,分别在对象删除和对象添加任务中降低了LPIPS值4.9%和3.2%。
RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
Authors: I-Hsiang Chen, Yu-Wei Liu, Tse-Yu Wu, Yu-Chien Chiang, Jen-Chien Yang, Wei-Ting Chen
First: 2026-02-25T15:27:57+00:00 · Latest: 2026-02-25T15:27:57+00:00
Comments: Accepted by CVPR2026; Project Page: https://robustvisrag.github.io
Abstract
Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence. However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages. To address this limitation, we introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization. RobustVisRAG uses a non-causal path to capture degradation signals through unidirectional attention and a causal path to learn purified semantics guided by these signals. Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between semantics and degradations, enabling stable retrieval and generation under challenging visual conditions. To evaluate robustness under realistic conditions, we introduce the Distortion-VisRAG dataset, a large-scale benchmark containing both synthetic and real-world degraded documents across seven domains, with 12 synthetic and 5 real distortion types that comprehensively reflect practical visual degradations. Experimental results show that RobustVisRAG improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining comparable accuracy on clean inputs.
中文标题/摘要
标题:RobustVisRAG:在视觉退化条件下具有因果关系意识的基于视觉检索增强生成
基于视觉的检索增强生成(VisRAG)利用视觉语言模型(VLMs)联合检索相关视觉文档,并基于多模态证据生成基于事实的答案。然而,现有的VisRAG模型在视觉输入遭受模糊、噪声、低光照或阴影等退化时性能下降,因为语义和退化因素在预训练视觉编码器中交织在一起,导致检索和生成阶段出现错误。为解决这一局限性,我们提出了RobustVisRAG,这是一种因果关系引导的双路径框架,该框架在保持效率和零样本泛化能力的同时提高了VisRAG的鲁棒性。RobustVisRAG使用非因果路径通过单向注意力捕捉退化信号,并使用这些信号学习因果路径中的净化语义。通过提出的非因果退化建模和因果语义对齐目标,该框架确保语义和退化之间的清晰分离,从而在具有挑战性的视觉条件下实现稳定的检索和生成。为了在现实条件下评估鲁棒性,我们引入了Distortion-VisRAG数据集,这是一个包含七个领域中合成和真实世界退化文档的大规模基准,其中包括12种合成和5种真实退化类型,全面反映了实际视觉退化。实验结果表明,RobustVisRAG在真实世界退化条件下分别提高了检索、生成和端到端性能7.35%、6.35%和12.40%,同时在干净输入上保持了相当的准确性。
Summary / 总结
RobustVisRAG is a causality-guided dual-path framework that enhances the robustness of Vision-based Retrieval-Augmented Generation (VisRAG) models under visual degradations. It uses a non-causal path to capture degradation signals and a causal path to learn purified semantics, improving retrieval and generation performance by 7.35% and 6.35%, respectively, on real-world degradations, while maintaining accuracy on clean inputs. The framework is evaluated on the Distortion-VisRAG dataset, which includes both synthetic and real-world degraded documents across seven domains.
RobustVisRAG 是一种因果引导的双路径框架,旨在增强视觉检索增强生成(VisRAG)模型在视觉退化条件下的鲁棒性。该框架通过非因果路径捕捉退化信号,并通过因果路径学习净化的语义,从而在真实世界退化条件下分别提高检索和生成性能 7.35% 和 6.35%。同时,该框架在准确度上保持与干净输入相当的水平,并在包含七个领域中合成和真实世界退化文档的 Distortion-VisRAG 数据集上进行了评估。
PanoEnv: Exploring 3D Spatial Intelligence in Panoramic Environments with Reinforcement Learning
Authors: Zekai Lin, Xu Zheng
First: 2026-02-25T15:12:17+00:00 · Latest: 2026-02-25T15:12:17+00:00
Abstract
360 panoramic images are increasingly used in virtual reality, autonomous driving, and robotics for holistic scene understanding. However, current Vision-Language Models (VLMs) struggle with 3D spatial reasoning on Equirectangular Projection (ERP) images due to geometric distortion and limited 3D supervision. We introduce PanoEnv, a large-scale VQA benchmark built from synthetic 3D environments, containing 14.8K questions across five categories (e.g., relative position, volume comparison) grounded in accurate 3D annotations including depth, segmentation, and bounding boxes. Benchmarking 14 state-of-the-art VLMs reveals limited 3D understanding, achieving only 49.34% overall accuracy and 8.36% on open-ended (OE) questions. To enhance 3D reasoning, we propose a reinforcement learning post-training framework based on Group Relative Policy Optimization (GRPO) with a ground-truth-guided reward that incorporates five geometry-aware strategies such as distance tolerance and spatial consistency. A two-stage curriculum further mitigates catastrophic forgetting: Stage 1 trains on structured tasks (true/false and multiple choice), and Stage 2 fine-tunes on mixed open-ended data to improve generalization. Our 7B model achieves new state-of-the-art performance, improving overall accuracy to 52.93% (+3.59%) and open-ended accuracy to 14.83% while maintaining structured-task performance. It also achieves top semantic evaluation scores (Q-Score 6.24, P-Score 5.95), surpassing 32B models. These results demonstrate that PanoEnv-QA and our curriculum-based RL framework effectively instill 3D spatial intelligence in VLMs for omnidirectional perception.
中文标题/摘要
标题:PanoEnv:使用强化学习探索全景环境中的3D空间智能
360度全景图像在虚拟现实、自动驾驶和机器人技术中越来越被用于整体场景理解。然而,当前的视觉-语言模型(VLMs)在球面投影(ERP)图像上的3D空间推理方面存在困难,这主要是由于几何失真和有限的3D监督。我们引入了PanoEnv,这是一个基于合成3D环境构建的大规模VQA基准,包含14800个问题,涵盖五个类别(例如,相对位置、体积比较),这些问题基于精确的3D注释,包括深度、分割和边界框。对14个最先进的VLMs进行基准测试显示,其3D理解能力有限,总体准确率为49.34%,开放性问题准确率为8.36%。为了增强3D推理,我们提出了一种基于组相对策略优化(GRPO)的强化学习后训练框架,该框架采用基于真实值的奖励,结合了五种几何感知策略,如距离容差和空间一致性。两阶段课程进一步减轻了灾难性遗忘:第一阶段在结构化任务(真/假和多项选择)上进行训练,第二阶段在混合开放性数据上进行微调以提高泛化能力。我们的7B模型达到了新的最先进的性能,总体准确率提高到52.93%(+3.59%),开放性问题准确率提高到14.83%,同时保持了结构化任务的性能。它还实现了顶级语义评估分数(Q-Score 6.24,P-Score 5.95),超过了32B模型。这些结果表明,PanoEnv-QA和我们基于课程的RL框架有效地在VLMs中植入了3D空间智能,以实现全方位感知。
Summary / 总结
The paper introduces PanoEnv, a VQA benchmark for 3D spatial reasoning in panoramic environments, addressing the limitations of current VLMs in 3D understanding. It benchmarks 14 state-of-the-art VLMs and finds limited 3D understanding, with only 49.34% overall accuracy. To improve 3D reasoning, the authors propose a reinforcement learning post-training framework using GRPO and a two-stage curriculum to mitigate catastrophic forgetting. Their model achieves new state-of-the-art performance, improving overall accuracy to 52.93% and open-ended accuracy to 14.83%. The model also outperforms 32B models in semantic evaluation scores.
论文介绍了PanoEnv,这是一个用于全景环境3D空间推理的VQA基准,旨在解决当前VLMs在处理3D几何失真方面的局限性。它对14种最先进的VLMs进行了基准测试,发现3D理解能力有限,总体准确率仅为49.34%。为了提高3D推理能力,作者提出了一种基于Group Relative Policy Optimization (GRPO)的强化学习框架,并使用基于真实值的奖励和两阶段课程来缓解灾难性遗忘。他们的模型达到了新的最先进的性能,将总体准确率提高到52.93%,开放性问题准确率提高到14.83%,并在语义评估中也超过了更大的模型。
QCS-ADME: Quantum Circuit Search for Drug Property Prediction with Imbalanced Data and Regression Adaptation
Authors: Kangyu Zheng, Tianfan Fu, Zhiding Liang
First: 2025-03-02T19:29:04+00:00 · Latest: 2026-02-25T15:06:31+00:00
Abstract
The biomedical field is beginning to explore the use of quantum machine learning (QML) for tasks traditionally handled by classical machine learning, especially in predicting ADME (absorption, distribution, metabolism, and excretion) properties, which are essential in drug evaluation. However, ADME tasks pose unique challenges for existing quantum computing systems (QCS) frameworks, as they involve both classification with unbalanced dataset and regression problems. These dual requirements make it necessary to adapt and refine current QCS frameworks to effectively address the complexities of ADME predictions. We propose a novel training-free scoring mechanism to evaluate QML circuit performance on imbalanced classification and regression tasks. Our mechanism demonstrates significant correlation between scoring metrics and test performance on imbalanced classification tasks. Additionally, we develop methods to quantify continuous similarity relationships between quantum states, enabling performance prediction for regression tasks. This represents a novel training-free approach to searching and evaluating QCS circuits specifically for regression applications. Validation on representative ADME tasks-eight imbalanced classification and four regression-demonstrates moderate correlation between our scoring metrics and circuit performance, significantly outperforming baseline scoring methods that show negligible correlation.
Summary / 总结
The paper addresses the challenge of predicting ADME properties using quantum machine learning (QML) in the context of imbalanced data and regression tasks. It introduces a training-free scoring mechanism to evaluate QML circuit performance on both classification and regression tasks, showing significant correlation with test performance. The method also quantifies continuous similarity relationships between quantum states for regression tasks, demonstrating moderate correlation with circuit performance and outperforming baseline methods.
研究旨在利用量子机器学习(QML)解决ADME属性预测中的不平衡数据集和回归任务挑战。作者提出了一种无需训练的评分机制来评估QML电路在分类和回归任务中的性能。主要发现包括评分指标与不平衡分类任务测试性能之间存在显著相关性,以及在回归任务中存在适度的相关性,优于基线方法的微弱相关性。
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
Authors: Shivam Chandhok, Qian Yang, Oscar Manas, Kanishk Jain, Leonid Sigal, Aishwarya Agrawal
Venue: CVPR 2026
First: 2025-06-01T17:05:35+00:00 · Latest: 2026-02-25T15:01:04+00:00
Comments: CVPR 2026
Abstract
Instruction tuning has been central to the success of recent vision-language models (VLMs), but it remains expensive-requiring large-scale datasets, high-quality annotations, and large compute budgets. We propose PRioritized cOncept learninG via Relative Error-driven Sample Selection (PROGRESS), a data- and compute-efficient framework that enables VLMs to dynamically select what to learn next based on their evolving needs during training. At each stage, the model tracks its learning progress across skills and selects the most informative samples-those it has not already mastered and that are not too difficult to learn at the current stage of training. This strategy effectively controls skill acquisition and the order in which skills are learned. Specifically, we sample from skills showing the highest learning progress, prioritizing those with the most rapid improvement. Unlike prior methods, PROGRESS requires no upfront answer annotations, queries answers only on a need basis, avoids reliance on additional supervision from auxiliary VLMs, and does not require compute-heavy gradient computations for data selection. Experiments across multiple instruction-tuning datasets of varying scales demonstrate that PROGRESS consistently outperforms state-of-the-art baselines with much less data and supervision. Additionally, we show strong cross-architecture generalization and transferability to larger models, validating PROGRESS as a scalable solution for efficient learning.
中文标题/摘要
标题:学习重要事项:基于相对误差驱动样本选择的优先概念学习
指令调优一直是最近视觉-语言模型(VLMs)成功的关键,但仍然非常昂贵,需要大规模数据集、高质量注释和大量计算预算。我们提出了基于相对误差驱动样本选择的优先概念学习(PROGRESS)框架,这是一种数据和计算高效的框架,使VLMs能够在训练过程中根据其不断变化的需求动态选择学习内容。在每个阶段,模型会跟踪其在各种技能上的学习进度,并选择最有信息量的样本——那些它尚未掌握且当前训练阶段难以学习的样本。这种策略有效地控制了技能获取及其学习顺序。具体来说,我们从学习进度最高的技能中采样,优先选择那些进步最快的技能。与先前的方法不同,PROGRESS不需要提前的正确答案注释,仅在需要时查询答案,避免了对辅助VLM的额外监督依赖,并不需要用于数据选择的计算密集型梯度计算。在多个不同规模的指令调优数据集上的实验表明,PROGRESS在使用更少数据和监督的情况下,始终优于最先进的基线。此外,我们展示了其在不同架构之间的强泛化能力和向更大模型的可转移性,验证了PROGRESS作为高效学习的可扩展解决方案的有效性。
Summary / 总结
The paper introduces PROGRESS, a data- and compute-efficient framework for instruction tuning in vision-language models. It dynamically selects samples for learning based on relative error, enabling the model to focus on skills it has not mastered and that are not too challenging. Experiments show that PROGRESS outperforms existing methods with less data and supervision, and it generalizes well across different architectures and model sizes.
研究旨在通过提出PROGRESS框架来解决视觉-语言模型指令调优的高成本问题。PROGRESS使模型能够根据其不断变化的需求动态选择最具信息量的学习样本。实验表明,PROGRESS在各种规模的指令调优数据集上比最先进的基线方法表现出色,且所需的数据和监督较少。
MindDriver: Introducing Progressive Multimodal Reasoning for Autonomous Driving
Authors: Lingjun Zhang, Yujian Yuan, Changjie Wu, Xinyuan Chang, Xin Cai, Shuang Zeng, Linzhe Shi, Sijin Wang, Hang Zhang, Mu Xu
First: 2026-02-25T14:34:50+00:00 · Latest: 2026-02-25T14:34:50+00:00
Comments: CVPR2026; Yujian Yuan and Lingjun Zhang contributed equally with random order
Abstract
Vision-Language Models (VLM) exhibit strong reasoning capabilities, showing promise for end-to-end autonomous driving systems. Chain-of-Thought (CoT), as VLM's widely used reasoning strategy, is facing critical challenges. Existing textual CoT has a large gap between text semantic space and trajectory physical space. Although the recent approach utilizes future image to replace text as CoT process, it lacks clear planning-oriented objective guidance to generate images with accurate scene evolution. To address these, we innovatively propose MindDriver, a progressive multimodal reasoning framework that enables VLM to imitate human-like progressive thinking for autonomous driving. MindDriver presents semantic understanding, semantic-to-physical space imagination, and physical-space trajectory planning. To achieve aligned reasoning processes in MindDriver, we develop a feedback-guided automatic data annotation pipeline to generate aligned multimodal reasoning training data. Furthermore, we develop a progressive reinforcement fine-tuning method to optimize the alignment through progressive high- level reward-based learning. MindDriver demonstrates superior performance in both nuScences open-loop and Bench2Drive closed-loop evaluation. Codes are available at https://github.com/hotdogcheesewhite/MindDriver.
中文标题/摘要
标题:MindDriver:引入渐进多模态推理以实现自动驾驶
视觉-语言模型(VLM)表现出强大的推理能力,显示出在端到端自动驾驶系统中的潜力。链式思考(CoT),作为VLM广泛使用的推理策略,正面临关键挑战。现有的文本CoT在文本语义空间和轨迹物理空间之间存在巨大差距。尽管最近的方法利用未来图像来替代文本作为CoT过程,但缺乏明确的面向规划的目标指导,以生成具有准确场景演化的图像。为了解决这些问题,我们创新地提出了MindDriver,这是一种渐进多模态推理框架,使VLM能够模仿人类的渐进思考方式以实现自动驾驶。MindDriver展示了语义理解、语义到物理空间的想象以及物理空间轨迹规划。为了在MindDriver中实现对齐的推理过程,我们开发了一种基于反馈的自动数据标注流水线,以生成对齐的多模态推理训练数据。此外,我们还开发了一种渐进强化微调方法,通过渐进的高阶奖励学习来优化对齐。MindDriver在nuScences开环和Bench2Drive闭环评估中均表现出优越的性能。代码可在https://github.com/hotdogcheesewhite/MindDriver/获取。
Summary / 总结
MindDriver introduces a progressive multimodal reasoning framework for autonomous driving, addressing the limitations of existing textual Chain-of-Thought methods. It combines semantic understanding, imagination of the physical space, and physical-space trajectory planning. MindDriver uses a feedback-guided automatic data annotation pipeline and a progressive reinforcement fine-tuning method to optimize alignment. Experimental results show superior performance in nuScences open-loop and Bench2Drive closed-loop evaluations.
MindDriver 提出了一种渐进多模态推理框架,以解决现有视觉-语言模型在语义空间和轨迹物理空间之间的差距问题。它使用反馈引导的自动数据标注流水线和渐进强化微调方法来对齐推理过程,并在 nuScences 和 Bench2Drive 评估中表现出色。
JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models
Authors: Jiaxin Song, Yixu Wang, Jie Li, Rui Yu, Yan Teng, Xingjun Ma, Yingchun Wang
Venue: NeurIPS 2025
First: 2025-05-26T07:23:00+00:00 · Latest: 2026-02-25T13:06:15+00:00
Comments: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract
Vision-Language Models (VLMs) exhibit impressive performance, yet the integration of powerful vision encoders has significantly broadened their attack surface, rendering them increasingly susceptible to jailbreak attacks. However, lacking well-defined attack objectives, existing jailbreak methods often struggle with gradient-based strategies prone to local optima and lacking precise directional guidance, and typically decouple visual and textual modalities, thereby limiting their effectiveness by neglecting crucial cross-modal interactions. Inspired by the Eliciting Latent Knowledge (ELK) framework, we posit that VLMs encode safety-relevant information within their internal fusion-layer representations, revealing an implicit safety decision boundary in the latent space. This motivates exploiting boundary to steer model behavior. Accordingly, we propose JailBound, a novel latent space jailbreak framework comprising two stages: (1) Safety Boundary Probing, which addresses the guidance issue by approximating decision boundary within fusion layer's latent space, thereby identifying optimal perturbation directions towards the target region; and (2) Safety Boundary Crossing, which overcomes the limitations of decoupled approaches by jointly optimizing adversarial perturbations across both image and text inputs. This latter stage employs an innovative mechanism to steer the model's internal state towards policy-violating outputs while maintaining cross-modal semantic consistency. Extensive experiments on six diverse VLMs demonstrate JailBound's efficacy, achieves 94.32% white-box and 67.28% black-box attack success averagely, which are 6.17% and 21.13% higher than SOTA methods, respectively. Our findings expose a overlooked safety risk in VLMs and highlight the urgent need for more robust defenses. Warning: This paper contains potentially sensitive, harmful and offensive content.
中文标题/摘要
标题:JailBound:破解视觉语言模型内部安全边界
视觉语言模型(VLMs)表现出色,但强大的视觉编码器的集成显著扩大了其攻击面,使其更容易受到破解攻击。然而,由于缺乏明确的攻击目标,现有的破解方法往往难以使用基于梯度的策略,这些策略容易陷入局部最优,缺乏精确的方向指导,并且通常会将视觉和文本模态分离,从而通过忽略关键的跨模态交互来限制其有效性。受ELK框架的启发,我们认为VLMs在其内部融合层表示中编码了与安全相关的信息,揭示了潜在的安全决策边界在潜在空间中的存在。这激发了利用边界引导模型行为的想法。因此,我们提出了JailBound,这是一种新颖的潜在空间破解框架,包括两个阶段:(1)安全边界探查,通过近似融合层潜在空间中的决策边界来解决指导问题,从而识别出通向目标区域的最佳扰动方向;(2)安全边界穿越,通过联合优化图像和文本输入的对抗性扰动来克服分离方法的局限性。这一阶段采用了一种创新机制,引导模型内部状态向政策违规输出转变,同时保持跨模态语义一致性。在六个不同VLMs上的广泛实验表明,JailBound的有效性,平均白盒攻击成功率94.32%,黑盒攻击成功率67.28%,分别比最先进的方法高出6.17%和21.13%。我们的研究揭示了VLMs中被忽视的安全风险,并强调了更稳健防御的迫切需求。警告:本文包含可能具有敏感性、危害性和冒犯性的内容。
Summary / 总结
JailBound is a novel framework for jailbreaking Vision-Language Models (VLMs) by exploiting the internal safety decision boundary in the latent space. It consists of two stages: Safety Boundary Probing and Safety Boundary Crossing. The former approximates the decision boundary to identify optimal perturbation directions, while the latter jointly optimizes adversarial perturbations across both image and text inputs to maintain cross-modal semantic consistency. Experiments show JailBound achieves higher success rates than state-of-the-art methods, with 94.32% white-box and 67.28% black-box attack success rates, indicating a significant safety risk in VLMs and the need for robust defenses.
JailBound 是一种新颖的框架,旨在利用 Vision-Language 模型(VLMs)的内部安全边界执行 jailbreak 攻击。该框架使用 Eliciting Latent Knowledge (ELK) 框架在潜在空间中探测和跨越安全边界。JailBound 包含两个阶段:安全边界探针和安全边界穿越。前者识别最优扰动方向,后者则在图像和文本输入之间联合优化对抗扰动,确保跨模态语义一致性。实验结果表明,JailBound 的成功率高于现有最先进的方法,白盒攻击成功率为 94.32%,黑盒攻击成功率为 67.28%,分别高出 6.17% 和 21.13%。
How to Take a Memorable Picture? Empowering Users with Actionable Feedback
Authors: Francesco Laiti, Davide Talon, Jacopo Staiano, Elisa Ricci
Venue: CVPR 2026
First: 2026-02-25T13:02:35+00:00 · Latest: 2026-02-25T13:02:35+00:00
Comments: Accepted @ CVPR 2026. Project page: https://laitifranz.github.io/MemCoach/
Abstract
Image memorability, i.e., how likely an image is to be remembered, has traditionally been studied in computer vision either as a passive prediction task, with models regressing a scalar score, or with generative methods altering the visual input to boost the image likelihood of being remembered. Yet, none of these paradigms supports users at capture time, when the crucial question is how to improve a photo memorability. We introduce the task of Memorability Feedback (MemFeed), where an automated model should provide actionable, human-interpretable guidance to users with the goal to enhance an image future recall. We also present MemCoach, the first approach designed to provide concrete suggestions in natural language for memorability improvement (e.g., "emphasize facial expression," "bring the subject forward"). Our method, based on Multimodal Large Language Models (MLLMs), is training-free and employs a teacher-student steering strategy, aligning the model internal activations toward more memorable patterns learned from a teacher model progressing along least-to-most memorable samples. To enable systematic evaluation on this novel task, we further introduce MemBench, a new benchmark featuring sequence-aligned photoshoots with annotated memorability scores. Our experiments, considering multiple MLLMs, demonstrate the effectiveness of MemCoach, showing consistently improved performance over several zero-shot models. The results indicate that memorability can not only be predicted but also taught and instructed, shifting the focus from mere prediction to actionable feedback for human creators.
中文标题/摘要
标题:如何拍摄令人难忘的照片?赋予用户可操作反馈
图像的难忘性,即图像被记住的可能性,传统上在计算机视觉中要么作为被动预测任务进行研究,模型回归一个标量分数,要么使用生成方法改变视觉输入以提高图像被记住的可能性。然而,这些范式在拍摄时并不支持用户,关键问题是如何提高照片的难忘性。我们引入了难忘性反馈(MemFeed)任务,其中自动化模型应提供可操作的、人类可理解的指导,以提高图像未来回忆的可能性。我们还提出了MemCoach,这是第一个为难忘性改进提供具体自然语言建议的方法(例如,“强调面部表情”,“将主题置于前景”)。我们的方法基于多模态大型语言模型(MLLMs),无需训练,并采用教师-学生引导策略,使模型内部激活朝向更难忘的模式对齐,这些模式是从教师模型沿从最难忘到最难忘的样本学习到的。为了在这一新任务上进行系统评估,我们进一步引入了MemBench,这是一个新的基准,包含序列对齐的照片拍摄,并附有标注的难忘性评分。我们的实验,考虑了多个MLLMs,证明了MemCoach的有效性,显示出在几个零样本模型上的一致改进。结果表明,难忘性不仅可以被预测,也可以被教授和指导,将重点从单纯的预测转向对人类创作者的可操作反馈。
Summary / 总结
The paper introduces the task of Memorability Feedback (MemFeed), where an automated model provides actionable guidance to users to enhance the memorability of their photos. The method, MemCoach, uses Multimodal Large Language Models (MLLMs) and a teacher-student steering strategy to generate natural language suggestions like 'emphasize facial expression.' Experiments show that MemCoach outperforms several zero-shot models, demonstrating the potential to teach and instruct users on how to improve photo memorability, shifting the focus from prediction to actionable feedback.
论文提出了记忆反馈(MemFeed)任务,旨在为用户提供增强照片记忆性的具体建议。作者介绍了MemCoach方法,使用多模态大型语言模型在自然语言中提供具体的改进建议。实验表明,MemCoach在多个零样本模型中表现出更优的效果,展示了通过教学和指导来提升记忆性的潜力,而不仅仅是简单的预测,为创作者提供了可操作的反馈。
Mitigating Semantic Collapse in Generative Personalization with Test-Time Embedding Adjustment
Authors: Anh Bui, Trang Vu, Trung Le, Junae Kim, Tamas Abraham, Rollin Omari, Amar Kaur, Dinh Phung
First: 2025-06-27T23:40:27+00:00 · Latest: 2026-02-25T12:55:13+00:00
Abstract
In this paper, we investigate the semantic collapsing problem in generative personalization, an under-explored topic where the learned visual concept ($V$) gradually shifts from its original textual meaning and comes to dominate other concepts in multi-concept input prompts. This issue not only reduces the semantic richness of complex input prompts like "a photo of $V$ wearing glasses and playing guitar" into simpler, less contextually rich forms such as "a photo of $V$" but also leads to simplified output images that fail to capture the intended concept. We identify the root cause as unconstrained optimisation, which allows the learned embedding $V$ to drift arbitrarily in the embedding space, both in direction and magnitude. To address this, we propose a simple yet effective training-free method that adjusts the magnitude and direction of pre-trained embedding at inference time, effectively mitigating the semantic collapsing problem. Our method is broadly applicable across different personalization methods and demonstrates significant improvements in text-image alignment in diverse use cases. Our code is anonymously published at https://github.com/tuananhbui89/Embedding-Adjustment
中文标题/摘要
标题:生成个性化中的语义坍塌缓解方法:测试时嵌入调整
在本文中,我们探讨了生成个性化中的语义坍塌问题,这是一个尚未充分研究的主题,其中学习到的视觉概念($V$)逐渐偏离其原始文本含义,并在多概念输入提示中主导其他概念。这一问题不仅将复杂的输入提示如“一张戴着眼镜弹吉他的$V$的照片”简化为更简单、上下文更贫乏的形式“一张$V$的照片”,还导致输出图像未能捕捉到预期的概念。我们确定其根本原因是不受约束的优化,这使得学习到的嵌入$V$在嵌入空间中任意地在方向和幅度上漂移。为了解决这一问题,我们提出了一种简单而有效的无需训练的方法,在推理时调整预训练嵌入的幅度和方向,有效地缓解了语义坍塌问题。我们的方法适用于不同的个性化方法,并在多种应用场景中显著提高了文本-图像对齐。我们的代码匿名发布在https://github.com/tuananhbui89/Embedding-Adjustment
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
Authors: Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu, James Kwok, Yu Zhang
Venue: CVPR 2026
First: 2026-02-25T12:45:45+00:00 · Latest: 2026-02-25T12:45:45+00:00
Comments: CVPR 2026
Abstract
Vision-Language Models (VLMs) have emerged as versatile solutions for zero-shot question answering (QA) across various domains. However, enabling VLMs to effectively comprehend structured graphs and perform accurate, efficient QA remains challenging. Existing approaches typically rely on one single graph topology representation (GTR), such as fixed-style visual images or unified text descriptions. This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries. To address this, we propose the $\mbox{DynamicGTR}$ framework, which dynamically selects the optimal GTR for each query during inference, thereby enhancing the zero-shot graph QA capabilities of VLMs with a customizable accuracy and brevity trade-off. Extensive experiments show that DynamicGTR not only improves VLM-based graph algorithm QA performance but also successfully transfers the experience trained from synthetic graph algorithm tasks to real-world applications like link prediction and node classification, without any additional training. Additionally, DynamicGTR demonstrates strong transferability across tasks, domains, and models, suggesting its potential as a flexible solution for broad graph scenarios.
中文标题/摘要
标题:DynamicGTR:利用图拓扑表示偏好增强VLM在图QA能力
视觉-语言模型(VLMs)已成为跨各种领域进行零样本问答(QA)的多功能解决方案。然而,使VLMs能够有效地理解结构化图并进行准确、高效的QA仍然具有挑战性。现有方法通常依赖单一的图拓扑表示(GTR),如固定风格的视觉图像或统一的文字描述。这种“一刀切”的策略往往忽视了模型特定和任务特定的偏好,导致对图相关查询的回答不准确或过长。为了解决这个问题,我们提出了DynamicGTR框架,该框架在推理过程中动态选择每个查询的最佳GTR,从而通过可定制的准确性和简洁性权衡来增强VLM的零样本图QA能力。大量实验表明,DynamicGTR不仅提高了基于VLM的图算法QA性能,还成功地将从合成图算法任务中训练的经验转移到了如链接预测和节点分类等实际应用中,无需额外训练。此外,DynamicGTR在任务、领域和模型之间表现出强大的可迁移性,表明其作为广泛图场景的灵活解决方案的潜力。
Summary / 总结
The paper introduces DynamicGTR, a framework that dynamically selects the most suitable graph topology representation for each query during inference, enhancing the zero-shot graph QA capabilities of Vision-Language Models. Experiments show that DynamicGTR improves performance in graph algorithm QA and successfully transfers knowledge from synthetic tasks to real-world applications like link prediction and node classification, without additional training. It also exhibits strong transferability across tasks, domains, and models.
论文提出了DynamicGTR框架,该框架在推理过程中动态选择最适合的图拓扑表示,以提升Vision-Language Models (VLMs)在零样本图问答中的能力。实验表明,DynamicGTR不仅提升了图算法问答的表现,还能将从合成任务中学到的知识成功转移到链接预测和节点分类等实际应用中,无需额外训练。此外,它在不同任务、领域和模型之间表现出强大的可迁移性。
Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy
Authors: Patrick Wienholt, Sophie Caselitz, Robert Siepmann, Philipp Bruners, Keno Bressem, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn
Venue: Eur Radiol (2026)
First: 2025-10-10T10:53:33+00:00 · Latest: 2026-02-25T12:16:45+00:00
Comments: Code is available: https://github.com/TruhnLab/VisionSemanticEntropy
Abstract
To determine whether using discrete semantic entropy (DSE) to reject questions likely to generate hallucinations can improve the accuracy of black-box vision-language models (VLMs) in radiologic image based visual question answering (VQA). This retrospective study evaluated DSE using two publicly available, de-identified datasets: the VQA-Med 2019 benchmark (500 images with clinical questions and short-text answers) and a diagnostic radiology dataset (206 cases: 60 computed tomography scans, 60 magnetic resonance images, 60 radiographs, 26 angiograms) with corresponding ground-truth diagnoses. GPT-4o and GPT-4.1 (Generative Pretrained Transformer; OpenAI) answered each question 15 times using a temperature of 1.0. Baseline accuracy was determined using low-temperature answers (temperature 0.1). Meaning-equivalent responses were grouped using bidirectional entailment checks, and DSE was computed from the relative frequencies of the resulting semantic clusters. Accuracy was recalculated after excluding questions with DSE > 0.6 or > 0.3. p-values and 95% confidence intervals were obtained using bootstrap resampling and a Bonferroni-corrected threshold of p < .004 for statistical significance. Across 706 image-question pairs, baseline accuracy was 51.7% for GPT-4o and 54.8% for GPT-4.1. After filtering out high-entropy questions (DSE > 0.3), accuracy on the remaining questions was 76.3% (retained questions: 334/706) for GPT-4o and 63.8% (retained questions: 499/706) for GPT-4.1 (both p < .001). Accuracy gains were observed across both datasets and largely remained statistically significant after Bonferroni correction. DSE enables reliable hallucination detection in black-box VLMs by quantifying semantic inconsistency. This method significantly improves diagnostic answer accuracy and offers a filtering strategy for clinical VLM applications.
中文标题/摘要
标题:使用离散语义熵过滤放射学视觉语言模型中的幻觉
研究使用离散语义熵(DSE)来拒绝可能生成幻觉的问题是否能提高黑盒视觉语言模型(VLMs)在放射学图像基于视觉问答(VQA)中的准确性。回顾性研究使用两个公开的、去标识化的数据集评估DSE:VQA-Med 2019基准(500张图像和临床问题及简短文本答案)和一个诊断放射学数据集(206例病例:60例CT扫描,60例MRI,60例X光片,26例血管造影),并附有相应的金标准诊断。GPT-4o和GPT-4.1(生成预训练变换器;OpenAI)使用温度1.0回答每个问题15次。基线准确性使用温度0.1的答案确定。语义等效响应被分组为双向蕴含检查,从结果的语义簇的相对频率计算DSE。在排除DSE > 0.6或 > 0.3的问题后重新计算准确性。使用自助重采样获得p值和95%置信区间,并使用Bonferroni校正的阈值p < .004进行统计显著性检验。在706个图像-问题对中,GPT-4o的基线准确性为51.7%,GPT-4.1为54.8%。在排除高熵问题(DSE > 0.3)后,GPT-4o在剩余问题上的准确性为76.3%(保留问题:334/706),GPT-4.1为63.8%(保留问题:499/706)(两者p < .001)。在两个数据集上均观察到准确性提升,并且在Bonferroni校正后大部分仍然具有统计显著性。DSE通过量化语义不一致来在黑盒VLM中可靠地检测幻觉。该方法显著提高了诊断答案的准确性,并为临床VLM应用提供了一种过滤策略。
Summary / 总结
This study aimed to evaluate the effectiveness of using discrete semantic entropy (DSE) to filter out questions likely to generate hallucinations in black-box vision-language models (VLMs) for radiologic image-based visual question answering (VQA). The research used two datasets: VQA-Med 2019 and a diagnostic radiology dataset. After filtering out questions with DSE > 0.3, the accuracy of GPT-4o and GPT-4.1 increased to 76.3% and 63.8%, respectively, both showing statistically significant improvements (p < .001).
该研究旨在评估使用离散语义熵(DSE)是否能提高黑盒视觉语言模型(VLMs)在放射学图像基础视觉问答(VQA)中的准确性。通过应用DSE过滤出可能产生幻觉的问题,研究发现,在排除高熵问题(DSE > 0.3)后,GPT-4o和GPT-4.1的准确率分别提高到76.3%和63.8%,而基线准确率分别为51.7%和54.8%。该方法在两个数据集上得到了验证,并在Bonferroni校正后仍保持统计显著性。
DocDjinn: Controllable Synthetic Document Generation with VLMs and Handwriting Diffusion
Authors: Marcel Lamott, Saifullah Saifullah, Nauman Riaz, Yves-Noel Weweler, Tobias Alt-Veit, Ahmad Sarmad Ali, Muhammad Armaghan Shakir, Adrian Kalwa, Momina Moetesum, Andreas Dengel, Sheraz Ahmed, Faisal Shafait, Ulrich Schwanecke, Adrian Ulges
First: 2026-02-25T11:52:13+00:00 · Latest: 2026-02-25T11:52:13+00:00
Abstract
Effective document intelligence models rely on large amounts of annotated training data. However, procuring sufficient and high-quality data poses significant challenges due to the labor-intensive and costly nature of data acquisition. Additionally, leveraging language models to annotate real documents raises concerns about data privacy. Synthetic document generation has emerged as a promising, privacy-preserving alternative. We propose DocDjinn, a novel framework for controllable synthetic document generation using Vision-Language Models (VLMs) that produces annotated documents from unlabeled seed samples. Our approach generates visually plausible and semantically consistent synthetic documents that follow the distribution of an existing source dataset through clustering-based seed selection with parametrized sampling. By enriching documents with realistic diffusion-based handwriting and contextual visual elements via semantic-visual decoupling, we generate diverse, high-quality annotated synthetic documents. We evaluate across eleven benchmarks spanning key information extraction, question answering, document classification, and document layout analysis. To our knowledge, this is the first work demonstrating that VLMs can generate faithful annotated document datasets at scale from unlabeled seeds that can effectively enrich or approximate real, manually annotated data for diverse document understanding tasks. We show that with only 100 real training samples, our framework achieves on average $87\%$ of the performance of the full real-world dataset. We publicly release our code and 140k+ synthetic document samples.
中文标题/摘要
标题:DocDjinn:使用VLM和手写扩散的可控合成文档生成
有效的文档智能模型依赖于大量标注的训练数据。然而,获取足够的高质量数据面临着劳动密集型和成本高昂的挑战。此外,利用语言模型对真实文档进行标注会引发数据隐私方面的担忧。合成文档生成作为一种有前景的隐私保护替代方案已经出现。我们提出了DocDjinn,一种使用视觉-语言模型(VLM)的新型框架,可以从未标注的种子样本生成标注文档。我们的方法通过基于聚类的种子选择和参数化采样生成视觉上可信且语义上一致的合成文档,使其遵循现有源数据集的分布。通过语义-视觉解耦,我们丰富了文档中的现实主义扩散手写和上下文视觉元素,生成多样且高质量的标注合成文档。我们在涵盖关键信息提取、问答、文档分类和文档布局分析的十一个基准上进行了评估。据我们所知,这是首次证明VLM可以从未标注种子生成忠实的标注文档数据集的工作,这些数据集可以有效地丰富或近似真实的手动标注数据,以用于各种文档理解任务。我们展示了,仅使用100个真实训练样本,我们的框架在平均性能上达到了全真实世界数据集的87%。我们公开发布了我们的代码和超过140,000个合成文档样本。
Summary / 总结
DocDjinn is a framework for generating controllable synthetic documents using Vision-Language Models (VLMs) and handwriting diffusion. It addresses the challenge of acquiring large amounts of annotated training data by producing annotated documents from unlabeled seed samples. The method involves clustering-based seed selection and parametrized sampling to generate visually plausible and semantically consistent synthetic documents. Experimental results show that DocDjinn can achieve 87% of the performance of full real-world datasets with just 100 real training samples across various document understanding tasks.
DocDjinn 是一个使用视觉语言模型(VLMs)和手写扩散生成可控合成文档的框架。该方法通过聚类基种子选择和参数化采样生成视觉上可信且语义一致的合成文档,以解决获取大量标注训练数据的挑战。实验结果表明,使用仅100个真实训练样本,DocDjinn 可以达到全真实世界数据集87%的性能,证明了其在多种文档理解任务中的有效性。
MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Authors: Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani, Babak Khalaj
First: 2026-02-18T21:28:56+00:00 · Latest: 2026-02-25T11:49:07+00:00
Abstract
Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings. MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step. Rather than using a single model, MALLVI coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning. Experiments in simulation and real-world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks. Code available at https://github.com/iman1234ahmadi/MALLVI .
中文标题/摘要
标题:MALLVI:一种综合机器人操作的多智能体框架
使用大型语言模型(LLMs)进行机器人操作的任务规划是一个新兴领域。先前的方法依赖于专门的模型、微调或提示调优,并且通常以开环方式运行,缺乏稳健的环境反馈,使其在动态环境中变得脆弱。MALLVI 提出了一种多智能体大型语言和视觉框架,能够实现闭环反馈驱动的机器人操作。给定自然语言指令和环境图像,MALLVI 生成可执行的原子动作供机器人执行。执行动作后,视觉语言模型(VLM)评估环境反馈并决定是否重复该过程或进行下一步。MALLVI 不使用单一模型,而是协调分解器、定位器、思考者和反思者等专门智能体来管理感知、定位、推理和高级规划。可选的描述者智能体提供初始状态的视觉记忆。反思者通过重新激活相关智能体来支持有针对性的错误检测和恢复,避免全面重新规划。在模拟和真实世界设置中的实验表明,迭代闭环多智能体协调可以提高泛化能力并增加零样本操作任务的成功率。代码可在 https://github.com/iman1234ahmadi/MALLVI 获取。
Summary / 总结
MALLVI is a multi-agent framework that uses large language models and vision to enable closed-loop feedback-driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVI generates executable actions and uses a Vision Language Model to evaluate environmental feedback, deciding whether to repeat actions or proceed. Experiments show that this iterative closed-loop approach improves generalization and success rates in zero-shot manipulation tasks.
MALLVI 是一个多代理框架,利用大型语言模型和视觉技术实现闭环反馈驱动的机器人操作。给定自然语言指令和图像后,MALLVI 生成可执行的动作,并使用视觉语言模型评估环境反馈,决定是否重复或继续。该框架包括感知、定位、推理和规划等专门代理,可选的视觉记忆代理提供初始状态的视觉记忆。实验表明,迭代闭环协调可以提高零样本操作任务的成功率和泛化能力。
Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models
Authors: Mingyu Cao, Alvaro H. C. Correia, Christos Louizos, Shiwei Liu, Lu Yin
First: 2026-02-11T15:41:09+00:00 · Latest: 2026-02-25T11:16:34+00:00
Comments: 11 pages, 8 figures
Abstract
Diffusion Language Models (DLMs) generate text by iteratively denoising a masked sequence, repeatedly deciding which positions to commit at each step. Standard decoding follows a greedy rule: unmask the most confident positions, yet this local choice can lock the model into a suboptimal unmasking order, especially on reasoning-heavy prompts. We present SOAR, a training-free decoding algorithm that adapts its behavior to the model's uncertainty. When confidence is low, SOAR briefly widens the search over alternative unmasking decisions to avoid premature commitments; when confidence is high, it collapses the search and decodes many positions in parallel to reduce the number of denoising iterations. Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and efficiency in DLM decoding. Our Code is available at https://github.com/duterscmy/SOAR
中文标题/摘要
标题:搜索或加速:基于置信度切换的位置光束搜索算法用于扩散语言模型
扩散语言模型(DLMs)通过迭代去噪掩蔽序列来生成文本,在每一步重复决定要提交的位置。标准解码遵循贪婪规则:去掩蔽最自信的位置,但这种局部选择可能会使模型陷入次优的去掩蔽顺序,尤其是在需要大量推理的提示上。我们提出了SOAR,一种无需训练的解码算法,能够根据模型的不确定性调整其行为。当置信度较低时,SOAR会暂时扩大搜索范围,避免过早的承诺;当置信度较高时,它会缩小搜索范围并并行解码多个位置,以减少去噪迭代次数。在数学推理和代码生成基准测试(GSM8K、MBPP、HumanEval)上,SOAR在Dream-7B和LLaDA-8B上提高了生成质量,同时保持了竞争力的推理速度,提供了一种在DLM解码中平衡质量和效率的实用方法。我们的代码可在https://github.com/duterscmy/SOAR获取。
Summary / 总结
The paper introduces SOAR, a decoding algorithm for Diffusion Language Models that adapts based on the model's confidence. When confidence is low, SOAR expands the search space to avoid premature commitments; when confidence is high, it narrows the search and accelerates decoding. SOAR improves generation quality on benchmarks like GSM8K, MBPP, and HumanEval while maintaining competitive inference speed, offering a practical balance between quality and efficiency in DLM decoding.
论文提出了SOAR,一种基于模型不确定性调整的Diffusion Language Models解码算法。在低置信度阶段,SOAR扩展搜索空间以避免过早承诺;而在高置信度阶段,它缩小搜索范围并并行解码以减少去噪迭代次数。SOAR在GSM8K、MBPP和HumanEval等基准测试中提高了生成质量,同时保持了竞争力的推理速度,提供了一种在质量和效率之间取得平衡的实际方法。
Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models
Authors: Zheyuan Gu, Qingsong Zhao, Yusong Wang, Zhaohong Huang, Xinqi Li, Cheng Yuan, Jiaowei Shao, Chi Zhang, Xuelong Li
Venue: CVPR 2026
First: 2026-02-25T10:54:55+00:00 · Latest: 2026-02-25T10:54:55+00:00
Comments: 16 pages, 9 figures. Submitted to CVPR 2026
Abstract
Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs with forensic capabilities: (1) Facial Perception, testing the ability to identify static visual artifacts; (2) Temporal Deepfake Grounding, requiring the localization of dynamic forgery artifacts across frames; and (3) Forensic Reasoning, challenging models to synthesize evidence for final authenticity verdicts. We evaluate a range of VLMs on FAQ and generate a corresponding instruction-tuning set, FAQ-IT. Extensive experiments show that models fine-tuned on FAQ-IT achieve advanced performance on both in-domain and cross-dataset detection benchmarks. Ablation studies further validate the impact of our key design choices, confirming that FAQ is the driving force behind the temporal reasoning capabilities of these VLMs.
中文标题/摘要
标题:超越静态文物:视觉语言模型中视频深伪推理的法医基准
当前的视觉-语言模型(VLMs)在深伪检测方面擅长识别空间伪迹,但忽视了一个关键维度:视频伪造中的时间不一致性。将VLMs适应于推理这些动态线索仍是一个独特的挑战。为了弥合这一差距,我们提出了法医问答(FAQ),这是一个大规模基准,将时间深伪分析表述为一个多项选择任务。FAQ引入了三层结构,逐步评估和装备VLMs的法医能力:(1)面部感知,测试识别静态视觉伪迹的能力;(2)时间深伪定位,要求在帧间定位动态伪造伪迹;(3)法医推理,挑战模型综合证据以得出最终的真伪裁决。我们对FAQ评估了一系列VLMs,并生成了相应的指令调优集FAQ-IT。广泛的实验表明,使用FAQ-IT微调的模型在领域内和跨数据集检测基准上均表现出高级性能。消融研究进一步验证了我们关键设计选择的影响,确认FAQ是这些VLMs时间推理能力的驱动力。
Summary / 总结
The research aims to improve deepfake detection by addressing the limitation of current Vision-Language Models (VLMs) in identifying temporal inconsistencies. It introduces Forensic Answer-Questioning (FAQ), a benchmark that evaluates VLMs on three levels: Facial Perception, Temporal Deepfake Grounding, and Forensic Reasoning. The study finds that models fine-tuned on FAQ-IT, a corresponding instruction-tuning set, perform well on both in-domain and cross-dataset detection benchmarks, validating the effectiveness of the proposed approach. Ablation studies confirm the importance of the benchmark's design choices in enhancing temporal reasoning capabilities of VLMs.
研究旨在通过关注视频中的时间不一致性来提升深度伪造检测能力,这是当前视觉语言模型(VLMs)所忽略的。研究引入了法医问答(FAQ)基准,评估VLMs在面部感知、时间伪造定位和法医推理三个层次上的表现。实验表明,经过FAQ-IT数据集微调的模型在领域内和跨数据集检测基准上表现出色,突显了时间推理在深度伪造检测中的重要性。消融研究进一步验证了FAQ基准在开发具有时间推理能力的VLMs中的有效性。
LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs
Authors: Benno Krojer, Shravan Nayak, Oscar Mañas, Vaibhav Adlakha, Desmond Elliott, Siva Reddy, Marius Mosbach
First: 2026-01-31T02:33:07+00:00 · Latest: 2026-02-25T10:06:33+00:00
Comments: Updates: small change in interpretability percentage for Qwen-based variants we trained (pre-processing fix), clarification in Section 3 on our method (after feedback from readers), additional appendix section
Abstract
Transforming a large language model (LLM) into a Vision-Language Model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens works by encoding a large text corpus and storing contextualized token representations for each token in that corpus. Visual token representations are then compared to their contextualized textual representations, with the top-k nearest neighbor representations providing descriptions of the visual token. We evaluate this method on 10 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations, opening up new directions for analyzing latent representations.
中文标题/摘要
标题:LatentLens:揭示LLM中的高度可解释视觉标记
将大型语言模型(LLM)转换为视觉语言模型(VLM)可以通过将视觉编码器的视觉标记映射到LLM的嵌入空间来实现。有趣的是,这种映射可以简单到一个浅层MLP变换。为了理解为什么LLM能够如此轻易地处理视觉标记,我们需要可解释的方法来揭示LLM处理过程中每一层视觉标记表示中编码的内容。在本文中,我们介绍了LatentLens,这是一种将潜在表示映射到自然语言描述的新方法。LatentLens通过编码大量文本语料库并为语料库中的每个标记存储上下文化标记表示来工作。然后将视觉标记表示与其上下文化的文本表示进行比较,前k个最近邻表示提供视觉标记的描述。我们在10种不同的VLM上评估了该方法,表明常用方法,如LogitLens,严重低估了视觉标记的可解释性。使用LatentLens后,所有研究模型和所有层中的大多数视觉标记都是可解释的。定性上,我们展示了LatentLens生成的描述具有语义意义,并且与单个标记相比,提供了更精细的解释。更广泛地说,我们的发现为视觉和语言表示之间的对齐提供了新的证据,为分析潜在表示开辟了新的方向。
Summary / 总结
LatentLens is a novel approach for enhancing the interpretability of visual tokens in large language models (LLMs) by mapping latent representations to natural language descriptions. It evaluates this method on 10 different Vision-Language Models (VLMs) and shows that commonly used methods underestimate the interpretability of visual tokens, whereas LatentLens reveals that the majority of visual tokens are interpretable across all models and layers. The descriptions generated by LatentLens are semantically meaningful and provide more detailed interpretations compared to individual tokens.
LatentLens 是一种将潜在的视觉令牌表示映射到自然语言描述的方法,旨在揭示大型语言模型(LLMs)中视觉令牌的可解释性。它通过编码大量文本语料库来存储上下文化的令牌表示,然后将这些表示与视觉令牌表示进行比较以提供描述。在10种不同的视觉语言模型(VLMs)上的评估表明,LatentLens 比现有方法如 LogitLens 更优,大多数视觉令牌在所有层中都是可解释的。LatentLens 生成的描述具有语义意义,并且相比单个令牌提供了更详细的解释。
Measuring the Measurers: Quality Evaluation of Hallucination Benchmarks for Large Vision-Language Models
Authors: Bei Yan, Jie Zhang, Zheng Yuan, Shiguang Shan, Xilin Chen
First: 2024-06-24T20:08:07+00:00 · Latest: 2026-02-25T09:48:59+00:00
Abstract
Despite the outstanding performance in multimodal tasks, Large Vision-Language Models (LVLMs) have been plagued by the issue of hallucination, i.e., generating content that is inconsistent with the corresponding visual inputs. While previous works have proposed various benchmarks to evaluate this issue, the quality of these evaluations remains unverified. We observe that some of these benchmarks may produce inconsistent evaluation results across repeated tests or fail to align with human evaluation. To address this, we propose a Hallucination benchmark Quality Measurement framework (HQM), which leverages specific indicators to assess both reliability and validity. Our empirical analysis using HQM reveals and pinpoints potential evaluation issues in existing benchmarks, exposing a critical gap in current hallucination evaluation. To bridge this gap, we propose HQH, a High-Quality Hallucination benchmark, which demonstrates superior reliability and validity under HQM, serving as a credible evaluation tool. Our large-scale evaluation of popular LVLMs on HQH reveals severe hallucination problems, which occur not only in the models' main answer to a question but also in additional analysis. This highlights the necessity for future model improvements to effectively mitigate hallucinations and reduce the associated security risks in real-world applications. Our benchmark is publicly available at https://github.com/HQHBench/HQHBench.
中文标题/摘要
标题:测量者测量:大型视觉-语言模型幻觉基准的质量评估
尽管在多模态任务中表现出色,大型视觉-语言模型(LVLMs)一直受到幻觉问题的困扰,即生成与相应视觉输入不一致的内容。虽然之前的工作提出了各种基准来评估这一问题,但这些评估的质量尚未得到验证。我们观察到,一些基准可能在重复测试中产生不一致的评估结果,或者无法与人类评估对齐。为了解决这一问题,我们提出了一种幻觉基准质量测量框架(HQM),该框架利用特定指标来评估可靠性和有效性。使用HQM的实证分析揭示并指出了现有基准中的潜在评估问题,暴露出当前幻觉评估中的关键差距。为了弥合这一差距,我们提出了HQM,一种高质量幻觉基准,它在HQM下表现出更高的可靠性和有效性,作为可信的评估工具。我们对流行的LVLMs在HQM上的大规模评估揭示了严重的幻觉问题,这些问题不仅出现在模型对问题的主要答案中,还出现在额外分析中。这突显了未来模型改进的必要性,以有效减轻幻觉并降低实际应用中的相关安全风险。我们的基准已公开发布在https://github.com/HQHBench/HQHBench。
Summary / 总结
The paper addresses the issue of hallucination in Large Vision-Language Models (LVLMs) by proposing a framework called HQM to evaluate the quality of existing benchmarks. HQM assesses both reliability and validity, revealing inconsistencies and alignment issues in current benchmarks. Based on HQM, a new benchmark HQH is proposed, which shows superior reliability and validity. The large-scale evaluation on HQH identifies severe hallucination problems in popular LVLMs, both in main answers and additional analyses, emphasizing the need for model improvements to mitigate these issues.
论文通过提出一个名为HQM的框架来评估现有幻觉基准的质量,解决了大型视觉-语言模型(LVLM)中的幻觉问题。作者发现一些基准会产生不一致的结果,并且与人类评估不一致。他们引入了Hqh,一个高质量的幻觉基准,显示了更好的可靠性和有效性。对流行的LVLM在Hqh上的大规模评估揭示了严重的幻觉问题,不仅存在于主要答案中,还存在于附加分析中,强调了未来模型改进以减轻这些问题的必要性。
SigVLP: Sigmoid Volume-Language Pre-Training for Self-Supervised CT-Volume Adaptive Representation Learning
Authors: Jiayi Wang, Hadrien Reynaud, Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Bjoern Menze, Bernhard Kainz
First: 2026-02-25T09:44:27+00:00 · Latest: 2026-02-25T09:44:27+00:00
Abstract
Large-scale, volumetric medical imaging datasets typically aggregate scans from different vendors and devices, resulting in highly variable resolution, slice thicknesses, and numbers of slices per study. Consequently, training representation models usually requires cropping or interpolating along the z-axis to obtain fixed-size blocks, which inevitably causes information loss. We propose a new training approach to overcome this limitation. Instead of absolute position embeddings, we interpret volumes as sequences of 3D chunks and adopt Rotary Position Embeddings, allowing us to treat the z-axis as an unconstrained temporal dimensions. Building on this idea, we introduce a new vision-language model: SigVLP. In SigVLP, we implement Rotary Position Embedding as the positional encoding method, which is applied directly within the attention operation, generating input-conditioned sine and cosine weights on the fly. This design ensures consistent alignment between query and key projections and adapts to any input sizes. To allow for variable input size during training, we sample Computed Tomography volumes in chunks and pair them with localized organ-wise textual observations. Compared to using entire reports for conditioning, chunkwise alignment provides finer-grained supervision, enabling the model to establish stronger correlations between the text and volume representations, thereby improving the precision of text-to-volume alignment. Our models are trained with the Muon optimizer and evaluated on a diverse set of downstream tasks, including zero-shot abnormality and organ classification, segmentation, and retrieval tasks.
中文标题/摘要
标题:SigVLP: Sigmoid体素-语言预训练用于自监督CT体数据自适应表示学习
大规模、体数据医学成像数据集通常会聚合来自不同供应商和设备的扫描,导致分辨率、切片厚度和每项研究的切片数量高度变化。因此,训练表示模型通常需要沿z轴裁剪或插值以获得固定大小的块,这不可避免地会导致信息丢失。我们提出了一种新的训练方法来克服这一限制。我们不使用绝对位置嵌入,而是将体数据解释为3D块序列,并采用旋转位置嵌入,允许我们将z轴视为不受约束的时间维度。在此基础上,我们引入了一种新的视觉-语言模型:SigVLP。在SigVLP中,我们实现旋转位置嵌入作为位置编码方法,直接应用于注意力操作中,生成输入条件下的正弦和余弦权重。此设计确保查询和键投影之间的一致对齐,并适应任何输入大小。为了在训练期间允许可变输入大小,我们以块的形式采样计算机断层扫描体数据,并与局部器官级文本观察配对。与使用整个报告进行条件相比,块对齐提供了更精细的监督,使模型能够建立更强的文本与体数据表示之间的关联,从而提高文本到体数据对齐的精度。我们的模型使用Muon优化器进行训练,并在一系列下游任务上进行评估,包括零样本异常和器官分类、分割和检索任务。
Summary / 总结
The research addresses the challenge of variable resolution and slice thickness in volumetric medical imaging datasets, which necessitates cropping or interpolating along the z-axis, leading to information loss. To overcome this, the authors propose SigVLP, a new vision-language model that uses Rotary Position Embeddings to treat the z-axis as an unconstrained temporal dimension. SigVLP is trained on chunked CT volumes paired with organ-wise textual observations, allowing for finer-grained supervision and better alignment between text and volume representations. The model is evaluated on various downstream tasks, showing improved precision in text-to-volume alignment compared to traditional methods.
研究旨在解决在具有不同分辨率和切片数量的体视医学影像数据集上训练表示模型的挑战。提出了一种新的视觉语言模型SigVLP,该模型使用旋转位置嵌入将z轴视为不受约束的时间维度。这种方法使模型能够处理可变输入大小,而无需裁剪或插值,从而保留更多信息。该模型通过将CT体积片段与器官相关的文本观察配对进行训练,从而实现更好的文本到体积对齐,并在异常和器官分类、分割和检索等下游任务上表现出色。