LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification
Authors: Md Akib Haider, Ahsan Bulbul, Nafis Fuad Shahid, Aimaan Ahmed, Mohammad Ishrak Abedin
First: 2026-03-04T11:36:32+00:00 · Latest: 2026-03-05T18:19:21+00:00
Abstract
Code comment classification is a critical task for automated software documentation and analysis. In the context of the NLBSE'26 Tool Competition, we present LoRA-MME, a Multi-Model Ensemble architecture utilizing Parameter-Efficient Fine-Tuning (PEFT). Our approach addresses the multi-label classification challenge across Java, Python, and Pharo by combining the strengths of four distinct transformer encoders: UniXcoder, CodeBERT, GraphCodeBERT, and CodeBERTa. By independently fine-tuning these models using Low-Rank Adaptation(LoRA) and aggregating their predictions via a learned weighted ensemble strategy, we maximize classification performance without the memory overhead of full model fine-tuning. Our tool achieved an F1 Weighted score of 0.7906 and a Macro F1 of 0.6867 on the test set. However, the computational cost of the ensemble resulted in a final submission score of 41.20%, highlighting the trade-off between semantic accuracy and inference efficiency.
Summary / 总结
Code comment classification is a critical task for automated software documentation and analysis.
VietJobs: A Vietnamese Job Advertisement Dataset
Authors: Hieu Pham Dinh, Hung Nguyen Huy, Mo El-Haj
Venue: Language Resources and Evaluation Conference (LREC) 2026
First: 2026-03-05T15:12:02+00:00 · Latest: 2026-03-05T15:12:02+00:00
Comments: 10 pages
Abstract
VietJobs is the first large-scale, publicly available corpus of Vietnamese job advertisements, comprising 48,092 postings and over 15 million words collected from all 34 provinces and municipalities across Vietnam. The dataset provides extensive linguistic and structured information, including job titles, categories, salaries, skills, and employment conditions, covering 16 occupational domains and multiple employment types (full-time, part-time, and internship). Designed to support research in natural language processing and labour market analytics, VietJobs captures substantial linguistic, regional, and socio-economic diversity. We benchmark several generative large language models (LLMs) on two core tasks: job category classification and salary estimation. Instruction-tuned models such as Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-IT demonstrate notable gains under few-shot and fine-tuned settings, while highlighting challenges in multilingual and Vietnamese-specific modelling for structured labour market prediction. VietJobs establishes a new benchmark for Vietnamese NLP and offers a valuable foundation for future research on recruitment language, socio-economic representation, and AI-driven labour market analysis. All code and resources are available at: https://github.com/VinNLP/VietJobs.
Summary / 总结
VietJobs is the first large-scale, publicly available corpus of Vietnamese job advertisements, comprising 48,092 postings and over 15 million words collected from all 34 provinces and municipalities across Vietnam.
Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning
Authors: Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Yuhua Li, Ruixuan Li
Venue: CVPR 2026
First: 2026-03-05T14:51:52+00:00 · Latest: 2026-03-05T14:51:52+00:00
Comments: CVPR 2026
Abstract
Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where CLIP has recently shown promising results due to its generalizability to downstream tasks. Current works indicate CLIP's text encoder is more suitable for cross-domain tasks, however, we find that \textbf{removing certain middle layers of the text encoder can effectively improve performance in SF-CDFSL}, which we call the Lost Layers. In this paper, we delve into this phenomenon for a deeper understanding. We discover that instead of being harmful for the SF-CDFSL task, the information in these layers is actually beneficial, but visual gaps prevent this useful information from being fully utilized, making these layers seem redundant. Based on this understanding, unlike current works that simply remove these layers, we propose a method to teachs the model to \textbf{re-utilize} information in these lost layers at both the layer and encoder levels, guiding the re-learning of the visual branch under domain shifts. Our approach effectively addresses the issue of underutilized information in the text encoder. Extensive experiments across various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 10 Meta-dataset datasets) demonstrate the effectiveness of our method. Code is available at https://github.com/zhenyuZ-HUST/CVPR26-VtT.
Summary / 总结
Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where CLIP has recently shown promising results due to its generalizability to downstream tasks.
SRasP: Self-Reorientation Adversarial Style Perturbation for Cross-Domain Few-Shot Learning
Authors: Wenqian Li, Pengfei Fang, Hui Xue
First: 2026-03-05T13:03:35+00:00 · Latest: 2026-03-05T13:03:35+00:00
Abstract
Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge from a seen source domain to unseen target domains, serving as a key benchmark for evaluating the robustness and transferability of models. Existing style-based perturbation methods mitigate domain shift but often suffer from gradient instability and convergence to sharp minima.To address these limitations, we propose a novel crop-global style perturbation network, termed Self-Reorientation Adversarial \underline{S}tyle \underline{P}erturbation (SRasP). Specifically, SRasP leverages global semantic guidance to identify incoherent crops, followed by reorienting and aggregating the style gradients of these crops with the global style gradients within one image. Furthermore, we propose a novel multi-objective optimization function to maximize visual discrepancy while enforcing semantic consistency among global, crop, and adversarial features. Applying the stabilized perturbations during training encourages convergence toward flatter and more transferable solutions, improving generalization to unseen domains. Extensive experiments are conducted on multiple CD-FSL benchmarks, demonstrating consistent improvements over state-of-the-art methods.
Summary / 总结
Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge from a seen source domain to unseen target domains, serving as a key benchmark for evaluating the robustness and transferability of models.
Lightweight and Scalable Transfer Learning Framework for Load Disaggregation
Authors: L. E. Garcia-Marrero, G. Petrone, E. Monmasson
First: 2026-03-05T09:43:48+00:00 · Latest: 2026-03-05T09:43:48+00:00
Comments: This work has been submitted to the IEEE for possible publication
Abstract
Non-Intrusive Load Monitoring (NILM) aims to estimate appliance-level consumption from aggregate electrical signals recorded at a single measurement point. In recent years, the field has increasingly adopted deep learning approaches; however, cross-domain generalization remains a persistent challenge due to variations in appliance characteristics, usage patterns, and background loads across homes. Transfer learning provides a practical paradigm to adapt models with limited target data. However, existing methods often assume a fixed appliance set, lack flexibility for evolving real-world deployments, remain unsuitable for edge devices, or scale poorly for real-time operation. This paper proposes RefQuery, a scalable multi-appliance, multi-task NILM framework that conditions disaggregation on compact appliance fingerprints, allowing one shared model to serve many appliances without a fixed output set. RefQuery keeps a pretrained disaggregation network fully frozen and adapts to a target home by learning only a per-appliance embedding during a lightweight backpropagation stage. Experiments on three public datasets demonstrate that RefQuery delivers a strong accuracy-efficiency trade-off against single-appliance and multi-appliance baselines, including modern Transformer-based methods. These results support RefQuery as a practical path toward scalable, real-time NILM on resource-constrained edge devices.
Summary / 总结
Non-Intrusive Load Monitoring (NILM) aims to estimate appliance-level consumption from aggregate electrical signals recorded at a single measurement point.
In-Training Defenses against Emergent Misalignment in Language Models
Authors: David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Esha Afzal, Robin Haselhorst, Lucie Flek, Florian Mai
First: 2025-08-08T12:10:28+00:00 · Latest: 2026-03-05T09:02:55+00:00
Comments: Under review
Abstract
Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EMA that are practical for providers who expose fine-tuning via an API: We evaluate whether they a) prevent broad misalignment, b) allow narrow misalignment, c) learn well on benign tasks, and d) remain coherent. We investigate four training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\mathcal{l}_2$ distance in feature space, (iii) preventative steering with an evil persona vector, and (iv) interleaving training examples from a general instruct-tuning dataset. We demonstrate that selecting interleaving data by the perplexity gap between aligned and misaligned models yields the best results overall.
Summary / 总结
Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EMA): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain.
VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters
Authors: Jiaxin Fan, Wenpo Song
First: 2026-03-05T08:51:33+00:00 · Latest: 2026-03-05T08:51:33+00:00
Abstract
Large Multimodal Models (LMMs) have achieved strong performance in vision-language understanding, yet many existing approaches rely on large-scale architectures and coarse supervision, which limits their ability to generate detailed image captions. In this work, we present VisionPangu, a compact 1.7B-parameter multimodal model designed to improve detailed image captioning through efficient multimodal alignment and high-quality supervision. Our model combines an InternVL-derived vision encoder with the OpenPangu-Embedded language backbone via a lightweight MLP projector and adopts an instruction-tuning pipeline inspired by LLaVA. By incorporating dense human-authored descriptions from the DOCCI dataset, VisionPangu improves semantic coherence and descriptive richness without relying on aggressive model scaling. Experimental results demonstrate that compact multimodal models can achieve competitive performance while producing more structured and detailed captions. The code and model weights will be publicly available at https://www.modelscope.cn/models/asdfgh007/visionpangu.
Summary / 总结
Large Multimodal Models (LMMs) have achieved strong performance in vision-language understanding, yet many existing approaches rely on large-scale architectures and coarse supervision, which limits their ability to generate detailed image captions.
AILS-NTUA at SemEval-2026 Task 3: Efficient Dimensional Aspect-Based Sentiment Analysis
Authors: Stavros Gazetas, Giorgos Filandrianos, Maria Lymperaiou, Paraskevi Tzouveli, Athanasios Voulodimos, Giorgos Stamou
First: 2026-03-05T08:30:59+00:00 · Latest: 2026-03-05T08:30:59+00:00
Abstract
In this paper, we present AILS-NTUA system for Track-A of SemEval-2026 Task 3 on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which encompasses three complementary problems: Dimensional Aspect Sentiment Regression (DimASR), Dimensional Aspect Sentiment Triplet Extraction (DimASTE), and Dimensional Aspect Sentiment Quadruplet Prediction (DimASQP) within a multilingual and multi-domain framework. Our methodology combines fine-tuning of language-appropriate encoder backbones for continuous aspect-level sentiment prediction with language-specific instruction tuning of large language models using LoRA for structured triplet and quadruplet extraction. This unified yet task-adaptive design emphasizes parameter-efficient specialization across languages and domains, enabling reduced training and inference requirements while maintaining strong effectiveness. Empirical results demonstrate that the proposed models achieve competitive performance and consistently surpass the provided baselines across most evaluation settings.
Summary / 总结
In this paper, we present AILS-NTUA system for Track-A of SemEval-2026 Task 3 on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which encompasses three complementary problems: Dimensional Aspect Sentiment Regression (DimASR), Dimensional Aspect Sentiment Triplet Extraction (DimASTE), and Dimensional Aspect Sentiment Quadruplet Prediction (DimASQP) within a multilingual and multi-domain framework.
Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO
Authors: Bowen Yu, Maolin Wang, Sheng Zhang, Binhao Wang, Yi Wen, Jingtong Gao, Bowen Liu, Zimo Zhao, Wanyu Wang, Xiangyu Zhao
First: 2026-02-05T05:27:11+00:00 · Latest: 2026-03-05T07:49:45+00:00
Comments: 22 pages, 12 figures
Abstract
Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge: teacher rationales are often too verbose for smaller models to faithfully reproduce. Existing approaches either compress reasoning into single-step, losing the interpretability that makes CoT valuable. We present a three-stage curriculum learning framework that addresses this capacity mismatch through progressive skill acquisition. First, we establish structural understanding via masked shuffled reconstruction. Second, we apply Group Relative Policy Optimization (GRPO) on masked completion tasks, enabling the model to discover its own balance between accuracy and brevity. Third, we identify persistent failure cases and guide the student to internalize teacher knowledge through targeted rewriting, again optimized with GRPO. Experiments on GSM8K demonstrate that our approach enables Qwen2.5-3B-Base to achieve an 11.29 percent accuracy improvement while reducing output length by 27.4 percent, surpassing both instruction-tuned variants and prior distillation methods.
Summary / 总结
Distilling Chain-of-Thought (CoT) reasoning from large language models into compact student models presents a fundamental challenge: teacher rationales are often too verbose for smaller models to faithfully reproduce.
Osmosis Distillation: Model Hijacking with the Fewest Samples
Authors: Yuchen Shi, Huajie Chen, Heng Xu, Zhiquan Liu, Jialiang Shen, Chi Liu, Shuai Zhou, Tianqing Zhu, Wanlei Zhou
First: 2026-03-05T06:34:06+00:00 · Latest: 2026-03-05T06:34:06+00:00
Abstract
Transfer learning is devised to leverage knowledge from pre-trained models to solve new tasks with limited data and computational resources. Meanwhile, dataset distillation has emerged to synthesize a compact dataset that preserves critical information from the original large dataset. Therefore, a combination of transfer learning and dataset distillation offers promising performance in evaluations. However, a non-negligible security threat remains undiscovered in transfer learning using synthetic datasets generated by dataset distillation methods, where an adversary can perform a model hijacking attack with only a few poisoned samples in the synthetic dataset. To reveal this threat, we propose Osmosis Distillation (OD) attack, a novel model hijacking strategy that targets deep learning models using the fewest samples. Comprehensive evaluations on various datasets demonstrate that the OD attack attains high attack success rates in hidden tasks while preserving high model utility in original tasks. Furthermore, the distilled osmosis set enables model hijacking across diverse model architectures, allowing model hijacking in transfer learning with considerable attack performance and model utility. We argue that awareness of using third-party synthetic datasets in transfer learning must be raised.
Summary / 总结
Transfer learning is devised to leverage knowledge from pre-trained models to solve new tasks with limited data and computational resources.
Detection of Illicit Content on Online Marketplaces using Large Language Models
Authors: Quoc Khoa Tran, Thanh Thi Nguyen, Campbell Wilson
First: 2026-03-05T01:15:03+00:00 · Latest: 2026-03-05T01:15:03+00:00
Comments: Accepted for publication in the Proceedings of the 8th International Conference on Natural Language Processing (ICNLP 2026)
Abstract
Online marketplaces, while revolutionizing global commerce, have inadvertently facilitated the proliferation of illicit activities, including drug trafficking, counterfeit sales, and cybercrimes. Traditional content moderation methods such as manual reviews and rule-based automated systems struggle with scalability, dynamic obfuscation techniques, and multilingual content. Conventional machine learning models, though effective in simpler contexts, often falter when confronting the semantic complexities and linguistic nuances characteristic of illicit marketplace communications. This research investigates the efficacy of Large Language Models (LLMs), specifically Meta's Llama 3.2 and Google's Gemma 3, in detecting and classifying illicit online marketplace content using the multilingual DUTA10K dataset. Employing fine-tuning techniques such as Parameter-Efficient Fine-Tuning (PEFT) and quantization, these models were systematically benchmarked against a foundational transformer-based model (BERT) and traditional machine learning baselines (Support Vector Machines and Naive Bayes). Experimental results reveal a task-dependent advantage for LLMs. In binary classification (illicit vs. non-illicit), Llama 3.2 demonstrated performance comparable to traditional methods. However, for complex, imbalanced multi-class classification involving 40 specific illicit categories, Llama 3.2 significantly surpassed all baseline models. These findings offer substantial practical implications for enhancing online safety, equipping law enforcement agencies, e-commerce platforms, and cybersecurity specialists with more effective, scalable, and adaptive tools for illicit content detection and moderation.
Summary / 总结
Online marketplaces, while revolutionizing global commerce, have inadvertently facilitated the proliferation of illicit activities, including drug trafficking, counterfeit sales, and cybercrimes.
Steering Awareness: Models Can Be Trained to Detect Activation Steering
Authors: Joshua Fonseca Rivera, David Demitri Africa
First: 2025-11-26T13:49:43+00:00 · Latest: 2026-03-04T23:33:38+00:00
Comments: 26 pages, 11 figures, 16 tables
Abstract
Activation steering - adding a vector to a language model's residual stream - is widely used to elicit latent behaviors and to probe safety-relevant properties. Many steering-based evaluations implicitly assume that the model cannot tell when such an intervention has occurred. We test this assumption by fine-tuning models to report (i) whether a steering vector was injected and (ii) which concept was injected, a capability we call steering awareness. Across seven open-source instruction-tuned models, the best achieves 95.5% detection on held-out concepts and 71.2% concept identification, with no false positives on our clean controls. We find that such detection transfers to novel vectors extracted by methods that produce directions aligned with contrastive activation addition, but fail for geometrically dissimilar methods. Crucially, detection does not confer behavioral robustness; detection-trained models are consistently more susceptible to steering in realistic settings than their base counterparts. Mechanistically, steering awareness arises from a distributed transformation that progressively rotates diverse injected vectors into a shared detection direction. These findings suggest that activation steering cannot be assumed to remain an undetectable intervention, with implications for the long-term reliability of steering-based safety evaluations and interpretability techniques more broadly.
Summary / 总结
Activation steering - adding a vector to a language model's residual stream - is widely used to elicit latent behaviors and to probe safety-relevant properties.
Improving the accuracy of physics-informed neural networks via last-layer retraining
Authors: Saad Qadeer, Panos Stinis
First: 2026-03-04T23:28:19+00:00 · Latest: 2026-03-04T23:28:19+00:00
Comments: Approved for release by Pacific Northwest National Laboratory
Abstract
Physics-informed neural networks (PINNs) are a versatile tool in the burgeoning field of scientific machine learning for solving partial differential equations (PDEs). However, determining suitable training strategies for them is not obvious, with the result that they typically yield moderately accurate solutions. In this article, we propose a method for improving the accuracy of PINNs by coupling them with a post-processing step that seeks the best approximation in a function space associated with the network. We find that our method yields errors four to five orders of magnitude lower than those of the parent PINNs across architectures and dimensions. Moreover, we can reuse the basis functions for the linear space in more complex settings, such as time-dependent and nonlinear problems, allowing for transfer learning. Out approach also provides a residual-based metric that allows us to optimally choose the number of basis functions employed.
Summary / 总结
Physics-informed neural networks (PINNs) are a versatile tool in the burgeoning field of scientific machine learning for solving partial differential equations (PDEs).
Robust Unscented Kalman Filtering via Recurrent Meta-Adaptation of Sigma-Point Weights
Authors: Kenan Majewski, Michał Modzelewski, Marcin Żugaj, Piotr Lichota
First: 2026-03-04T18:27:59+00:00 · Latest: 2026-03-04T18:27:59+00:00
Comments: 8 pages, 3 figures, Submitted to the 29th International Conference on Information Fusion (FUSION 2026)
Abstract
The Unscented Kalman Filter (UKF) is a ubiquitous tool for nonlinear state estimation; however, its performance is limited by the static parameterization of the Unscented Transform (UT). Conventional weighting schemes, governed by fixed scaling parameters, assume implicit Gaussianity and fail to adapt to time-varying dynamics or heavy-tailed measurement noise. This work introduces the Meta-Adaptive UKF (MA-UKF), a framework that reformulates sigma-point weight synthesis as a hyperparameter optimization problem addressed via memory-augmented meta-learning. Unlike standard adaptive filters that rely on instantaneous heuristic corrections, our approach employs a Recurrent Context Encoder to compress the history of measurement innovations into a compact latent embedding. This embedding informs a policy network that dynamically synthesizes the mean and covariance weights of the sigma points at each time step, effectively governing the filter's trust in the prediction versus the measurement. By optimizing the system end-to-end through the filter's recursive logic, the MA-UKF learns to maximize tracking accuracy while maintaining estimation consistency. Numerical benchmarks on maneuvering targets demonstrate that the MA-UKF significantly outperforms standard baselines, exhibiting superior robustness to non-Gaussian glint noise and effective generalization to out-of-distribution (OOD) dynamic regimes unseen during training.
Summary / 总结
The Unscented Kalman Filter (UKF) is a ubiquitous tool for nonlinear state estimation; however, its performance is limited by the static parameterization of the Unscented Transform (UT).
Position: Vector Prompt Interfaces Should Be Exposed to Enable Customization of Large Language Models
Authors: Liangwei Yang, Shiyu Wang, Haolin Chen, Rithesh Murthy, Ming Zhu, Jielin Qiu, Zixiang Chen, Juntao Tan, Jianguo Zhang, Zhiwei Liu, Wenting Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Shelby Heinecke
First: 2026-03-04T17:08:47+00:00 · Latest: 2026-03-04T17:08:47+00:00
Abstract
As large language models (LLMs) transition from research prototypes to real-world systems, customization has emerged as a central bottleneck. While text prompts can already customize LLM behavior, we argue that text-only prompting does not constitute a suitable control interface for scalable, stable, and inference-only customization. This position paper argues that model providers should expose \emph{vector prompt inputs} as part of the public interface for customizing LLMs. We support this position with diagnostic evidence showing that vector prompt tuning continues to improve with increasing supervision whereas text-based prompt optimization saturates early, and that vector prompts exhibit dense, global attention patterns indicative of a distinct control mechanism. We further discuss why inference-only customization is increasingly important under realistic deployment constraints, and why exposing vector prompts need not fundamentally increase model leakage risk under a standard black-box threat model. We conclude with a call to action for the community to rethink prompt interfaces as a core component of LLM customization.
Summary / 总结
As large language models (LLMs) transition from research prototypes to real-world systems, customization has emerged as a central bottleneck.
Traces of Social Competence in Large Language Models
Authors: Tom Kouwenhoven, Michiel van der Meer, Max van Duijn
First: 2026-03-04T15:19:27+00:00 · Latest: 2026-03-04T15:19:27+00:00
Abstract
The False Belief Test (FBT) has been the main method for assessing Theory of Mind (ToM) and related socio-cognitive competencies. For Large Language Models (LLMs), the reliability and explanatory potential of this test have remained limited due to issues like data contamination, insufficient model details, and inconsistent controls. We address these issues by testing 17 open-weight models on a balanced set of 192 FBT variants (Trott et al. 2023) using Bayesian Logistic regression to identify how model size and post-training affect socio-cognitive competence. We find that scaling model size benefits performance, but not strictly. A cross-over effect reveals that explicating propositional attitudes (X thinks) fundamentally alters response patterns. Instruction tuning partially mitigates this effect, but further reasoning-oriented finetuning amplifies it. In a case study analysing social reasoning ability throughout OLMo 2 training, we show that this cross-over effect emerges during pre-training, suggesting that models acquire stereotypical response patterns tied to mental-state vocabulary that can outweigh other scenario semantics. Finally, vector steering allows us to isolate a think vector as the causal driver of observed FBT behaviour.
Summary / 总结
The False Belief Test (FBT) has been the main method for assessing Theory of Mind (ToM) and related socio-cognitive competencies.
Dutch Metaphor Extraction from Cancer Patients' Interviews and Forum Data using LLMs and Human in the Loop
Authors: Lifeng Han, David Lindevelt, Sander Puts, Erik van Mulligen, Suzan Verberne
First: 2025-11-09T15:49:47+00:00 · Latest: 2026-03-04T15:12:08+00:00
Comments: Ongoing project report, on behalf of 4D PICTURE https://4dpicture.eu/
Abstract
Metaphors and metaphorical language (MLs) play an important role in healthcare communication between clinicians, patients, and patients' family members. In this work, we focus on Dutch language data from cancer patients. We extract metaphors used by patients using two data sources: (1) cancer patient storytelling interview data and (2) online forum data, including patients' posts, comments, and questions to professionals. We investigate how current state-of-the-art large language models (LLMs) perform on this task by exploring different prompting strategies such as chain of thought reasoning, few-shot learning, and self-prompting. With a human-in-the-loop setup, we verify the extracted metaphors and compile the outputs into a corpus named HealthQuote.NL. We believe the extracted metaphors can support better patient care, for example shared decision making, improved communication between patients and clinicians, and enhanced patient health literacy. They can also inform the design of personalized care pathways. We share prompts and related resources at https://github.com/4dpicture/HealthQuote.NL
Summary / 总结
Metaphors and metaphorical language (MLs) play an important role in healthcare communication between clinicians, patients, and patients' family members.
A Bayesian Framework for Active Tactile Object Recognition, Pose Estimation and Shape Transfer Learning
Authors: Haodong Zheng, Andrei Jalba, Raymond H. Cuijpers, Wijnand IJsselsteijn, Sanne Schoenmakers
First: 2024-09-10T23:35:30+00:00 · Latest: 2026-03-04T15:02:27+00:00
Abstract
As humans can explore and understand the world through active touch, similar capability is desired for robots. In this paper, we address the problem of active tactile object recognition, pose estimation and shape transfer learning, where a customized particle filter (PF) and Gaussian process implicit surface (GPIS) is combined in a unified Bayesian framework. Upon new tactile input, the customized PF updates the joint distribution of the object class and object pose while tracking the novelty of the object. Once a novel object is identified, its shape will be reconstructed using GPIS. By grounding the prior of the GPIS with the maximum-a-posteriori (MAP) estimation from the PF, the knowledge about known shapes can be transferred to learn novel shapes. An exploration procedure based on global shape estimation is proposed to guide active data acquisition and terminate the exploration upon sufficient information. Through experiments in simulation, the proposed framework demonstrated its effectiveness and efficiency in estimating object class and pose for known objects and learning novel shapes. Furthermore, it can recognize previously learned shapes reliably.
Summary / 总结
As humans can explore and understand the world through active touch, similar capability is desired for robots.
Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
Authors: Dongnuan Cai, Henghui Du, Chang Zhou, Xi Chen, Dan Guo, Hongyuan Zhang, Xuelong Li, Di Hu
First: 2026-03-04T14:43:57+00:00 · Latest: 2026-03-04T14:43:57+00:00
Abstract
Developing Audio-Visual Large Language Models (AV-LLMs) for unified scene understanding is pivotal in multimodal intelligence. While instruction tuning enables pre-trained models with multi-task abilities, we observe that conventional multi-task unification methods often suffer from severe negative transfer, where nearly 55% of tasks degrade compared to single-task training. We attribute this phenomenon to audio-visual task heterogeneity, characterized by disparate task granularity and divergent capability demands, which lead to negative interference under joint training. To tackle this, we present Crab$^{+}$, a scalable and unified audio-visual scene understanding model that addresses task heterogeneity through explicit cooperation from both data and model perspectives. On the data side, we introduce AV-UIE v2, a comprehensive Audio-Visual Unified Instruction-tuning dataset with Explicit reasoning processes. It contains approximately 222K samples spanning 17 datasets and 7 tasks, enabling the model to capture cross-task relationships at different levels of granularity. On the model side, we design a unified interface to align heterogeneous task formulations, and propose Interaction-aware LoRA (I-LoRA), which explicitly models inter-task relationships via dynamic routing to coordinate distinct audio-visual interaction patterns, mitigating parameter interference. Extensive experiments show Crab$^{+}$ covers broader tasks than existing unified models while outperforming specialized models on various benchmarks. We successfully reverse the negative transfer trend, achieving positive transfer where multi-task learning surpasses single-task baselines in nearly 88% of tasks. These results hold across diverse AV-LLM paradigms and are validated through in-depth visualization, positioning Crab$^{+}$ as a robust step towards holistic audio-visual scene understanding.
Summary / 总结
Developing Audio-Visual Large Language Models (AV-LLMs) for unified scene understanding is pivotal in multimodal intelligence.
Inference-Time Toxicity Mitigation in Protein Language Models
Authors: Manuel Fernández Burda, Santiago Aranguri, Iván Arcuschin Moreno, Enzo Ferrante
First: 2026-03-04T13:28:51+00:00 · Latest: 2026-03-04T13:28:51+00:00
Abstract
Protein language models (PLMs) are becoming practical tools for de novo protein design, yet their dual-use potential raises safety concerns. We show that domain adaptation to specific taxonomic groups can elicit toxic protein generation, even when toxicity is not the training objective. To address this, we adapt Logit Diff Amplification (LDA) as an inference-time control mechanism for PLMs. LDA modifies token probabilities by amplifying the logit difference between a baseline model and a toxicity-finetuned model, requiring no retraining. Across four taxonomic groups, LDA consistently reduces predicted toxicity rate (measured via ToxDL2) below the taxon-finetuned baseline while preserving biological plausibility. We evaluate quality using Fréchet ESM Distance and predicted foldability (pLDDT), finding that LDA maintains distributional similarity to natural proteins and structural viability (unlike activation-based steering methods that tend to degrade sequence properties). Our results demonstrate that LDA provides a practical safety knob for protein generators that mitigates elicited toxicity while retaining generative quality.
Summary / 总结
Protein language models (PLMs) are becoming practical tools for de novo protein design, yet their dual-use potential raises safety concerns.
Black Box Meta-Learning Intrinsic Rewards
Authors: Octavio Pappalardo, Rodrigo Ramele, Juan Miguel Santos
First: 2024-07-31T12:09:33+00:00 · Latest: 2026-03-04T11:49:12+00:00
Comments: Improved exposition; no technical changes to content
Abstract
The broader application of reinforcement learning (RL) is limited by challenges including data efficiency, generalization capability, and ability to learn in sparse-reward environments. Meta-learning has emerged as a promising approach to address these issues by optimizing components of the learning algorithm to meet desired characteristics. Additionally, a different line of work has extensively studied the use of intrinsic rewards to enhance the exploration capabilities of algorithms. This work investigates how meta-learning can improve the training signal received by RL agents. We introduce a method to learn intrinsic rewards within a reinforcement learning framework that bypasses the typical computation of meta-gradients through an optimization process by treating policy updates as black boxes. We validate our approach against training with extrinsic rewards, demonstrating its effectiveness, and additionally compare it to the use of a meta-learned advantage function. Experiments are carried out on distributions of continuous control tasks with both parametric and non-parametric variations. Furthermore, only sparse rewards are used during evaluation. Code is available at: https: //github.com/Octavio-Pappalardo/Meta-learning-rewards
Summary / 总结
The broader application of reinforcement learning (RL) is limited by challenges including data efficiency, generalization capability, and ability to learn in sparse-reward environments.
Dictionary Based Pattern Entropy for Causal Direction Discovery
Authors: Harikrishnan N B, Shubham Bhilare, Aditi Kathpalia, Nithin Nagaraj
First: 2026-03-04T09:53:39+00:00 · Latest: 2026-03-04T09:53:39+00:00
Comments: 13 pages
Abstract
Discovering causal direction from temporal observational data is particularly challenging for symbolic sequences, where functional models and noise assumptions are often unavailable. We propose a novel \emph{Dictionary Based Pattern Entropy ($DPE$)} framework that infers both the direction of causation and the specific subpatterns driving changes in the effect variable. The framework integrates \emph{Algorithmic Information Theory} (AIT) and \emph{Shannon Information Theory}. Causation is interpreted as the emergence of compact, rule based patterns in the candidate cause that systematically constrain the effect. $DPE$ constructs direction-specific dictionaries and quantifies their influence using entropy-based measures, enabling a principled link between deterministic pattern structure and stochastic variability. Causal direction is inferred via a minimum-uncertainty criterion, selecting the direction exhibiting stronger and more consistent pattern-driven organization. As summarized in Table 7, $DPE$ consistently achieves reliable performance across diverse synthetic systems, including delayed bit-flip perturbations, AR(1) coupling, 1D skew-tent maps, and sparse processes, outperforming or matching competing AIT-based methods ($ETC_E$, $ETC_P$, $LZ_P$). In biological and ecological datasets, performance is competitive, while alternative methods show advantages in specific genomic settings. Overall, the results demonstrate that minimizing pattern level uncertainty yields a robust, interpretable, and broadly applicable framework for causal discovery.
Summary / 总结
Discovering causal direction from temporal observational data is particularly challenging for symbolic sequences, where functional models and noise assumptions are often unavailable.
Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect
Authors: Minh Duc Bui, Manuel Mager, Peter Herbert Kann, Katharina von der Wense
First: 2026-02-18T20:29:02+00:00 · Latest: 2026-03-04T06:52:31+00:00
Comments: Accepted at LREC 2026
Abstract
Meenzerisch, the dialect spoken in the German city of Mainz, is also the traditional language of the Mainz carnival, a yearly celebration well known throughout Germany. However, Meenzerisch is on the verge of dying out-a fate it shares with many other German dialects. Natural language processing (NLP) has the potential to help with the preservation and revival efforts of languages and dialects. However, so far no NLP research has looked at Meenzerisch. This work presents the first research in the field of NLP that is explicitly focused on the dialect of Mainz. We introduce a digital dictionary-an NLP-ready dataset derived from an existing resource (Schramm, 1966)-to support researchers in modeling and benchmarking the language. It contains 2,351 words in the dialect paired with their meanings described in Standard German. We then use this dataset to answer the following research questions: (1) Can state-of-the-art large language models (LLMs) generate definitions for dialect words? (2) Can LLMs generate words in Meenzerisch, given their definitions? Our experiments show that LLMs can do neither: the best model for definitions reaches only 6.27% accuracy and the best word generation model's accuracy is 1.51%. We then conduct two additional experiments in order to see if accuracy is improved by few-shot learning and by extracting rules from the training set, which are then passed to the LLM. While those approaches are able to improve the results, accuracy remains below 10%. This highlights that additional resources and an intensification of research efforts focused on German dialects are desperately needed.
Summary / 总结
Meenzerisch, the dialect spoken in the German city of Mainz, is also the traditional language of the Mainz carnival, a yearly celebration well known throughout Germany.
Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment
Authors: Shravani Hariprasad
First: 2026-03-01T04:37:48+00:00 · Latest: 2026-03-04T06:10:32+00:00
Comments: 30 pages, 7 figures, 2 tables
Abstract
Small open-source language models are gaining attention for healthcare applications in low-resource settings where cloud infrastructure and GPU hardware may be unavailable. However, their reliability under different prompt phrasings remains poorly understood. We evaluate five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B, a domain-pretrained model without instruction tuning) across three clinical question answering datasets (MedQA, MedMCQA, and PubMedQA) using five prompt styles: original, formal, simplified, roleplay, and direct. Model behavior is evaluated using consistency scores, accuracy, and instruction-following failure rates. All experiments were conducted locally on consumer CPU hardware without fine-tuning.
Consistency and accuracy were largely independent across models. Gemma 2 achieved the highest consistency (0.845-0.888) but the lowest accuracy (33.0-43.5%), while Llama 3.2 showed moderate consistency (0.774-0.807) alongside the highest accuracy (49.0-65.0%). Roleplay prompts consistently reduced accuracy across all models, with Phi-3 Mini dropping 21.5 percentage points on MedQA. Meditron-7B exhibited near-complete instruction-following failure on PubMedQA (99.0% UNKNOWN rate), indicating that domain pretraining alone is insufficient for structured clinical QA.
These findings show that high consistency does not imply correctness: models can be reliably wrong, a dangerous failure mode in clinical AI. Llama 3.2 demonstrated the strongest balance of accuracy and reliability for low-resource deployment. Safe clinical AI requires joint evaluation of consistency, accuracy, and instruction adherence.
Summary / 总结
Small open-source language models are gaining attention for healthcare applications in low-resource settings where cloud infrastructure and GPU hardware may be unavailable.
Succeeding at Scale: Automated Dataset Construction and Query-Side Adaptation for Multi-Tenant Search
Authors: Prateek Jain, Shabari S Nair, Ritesh Goru, Prakhar Agarwal, Ajay Yadav, Yoga Sri Varshan Varadharajan, Constantine Caramanis
First: 2026-01-08T06:44:40+00:00 · Latest: 2026-03-03T21:42:18+00:00
Abstract
Large-scale multi-tenant retrieval systems generate extensive query logs but lack curated relevance labels for effective domain adaptation, resulting in substantial underutilized "dark data". This challenge is compounded by the high cost of model updates, as jointly fine-tuning query and document encoders requires full corpus re-indexing, which is impractical in multi-tenant settings with thousands of isolated indices. We introduce DevRev-Search, a passage retrieval benchmark for technical customer support built via a fully automated pipeline. Candidate generation uses fusion across diverse sparse and dense retrievers, followed by an LLM-as-a-Judge for consistency filtering and relevance labeling. We further propose an Index-Preserving Adaptation strategy that fine-tunes only the query encoder, achieving strong performance gains while keeping document indices fixed. Experiments on DevRev-Search, SciFact, and FiQA-2018 show that Parameter-Efficient Fine-Tuning (PEFT) of the query encoder delivers a remarkable quality-efficiency trade-off, enabling scalable and practical enterprise search adaptation.
Summary / 总结
Large-scale multi-tenant retrieval systems generate extensive query logs but lack curated relevance labels for effective domain adaptation, resulting in substantial underutilized "dark data".
Test-Time Meta-Adaptation with Self-Synthesis
Authors: Zeyneb N. Kaya, Nick Rui
Venue: ICLR 2026
First: 2026-03-03T21:16:18+00:00 · Latest: 2026-03-03T21:16:18+00:00
Comments: 5 pages, 2 figures, 1 table. Accepted to ICLR 2026 LIT and Data-FM Workshops
Abstract
As strong general reasoners, large language models (LLMs) encounter diverse domains and tasks, where the ability to adapt and self-improve at test time is valuable. We introduce MASS, a meta-learning framework that enables LLMs to self-adapt by generating problem-specific synthetic training data and performing targeted self-updates optimized for downstream performance at inference time. We train this behavior end-to-end via bilevel optimization: an inner loop adapts on self-generated examples while an outer loop meta-learns data-attribution signals and rewards post-update task performance. The synthetic data is optimized with scalable meta-gradients, backpropagating the downstream loss through the inner updates to reward useful generations. Experiments on mathematical reasoning show that MASS learns to synthesize per-instance curricula that yield effective, data-efficient test-time adaptation.
Summary / 总结
As strong general reasoners, large language models (LLMs) encounter diverse domains and tasks, where the ability to adapt and self-improve at test time is valuable.
Guiding Sparse Neural Networks with Neurobiological Principles to Elicit Biologically Plausible Representations
Authors: Patrick Inoue, Florian Röhrbein, Andreas Knoblauch
First: 2026-03-03T18:27:37+00:00 · Latest: 2026-03-03T18:27:37+00:00
Abstract
While deep neural networks (DNNs) have achieved remarkable performance in tasks such as image recognition, they often struggle with generalization, learning from few examples, and continuous adaptation - abilities inherent in biological neural systems. These challenges arise due to DNNs' failure to emulate the efficient, adaptive learning mechanisms of biological networks. To address these issues, we explore the integration of neurobiologically inspired assumptions in neural network learning. This study introduces a biologically inspired learning rule that naturally integrates neurobiological principles, including sparsity, lognormal weight distributions, and adherence to Dale's law, without requiring explicit enforcement. By aligning with these core neurobiological principles, our model enhances robustness against adversarial attacks and demonstrates superior generalization, particularly in few-shot learning scenarios. Notably, integrating these constraints leads to the emergence of biologically plausible neural representations, underscoring the efficacy of incorporating neurobiological assumptions into neural network design. Preliminary results suggest that this approach could extend from feature-specific to task-specific encoding, potentially offering insights into neural resource allocation for complex tasks.
Summary / 总结
While deep neural networks (DNNs) have achieved remarkable performance in tasks such as image recognition, they often struggle with generalization, learning from few examples, and continuous adaptation - abilities inherent in biological neural systems.
No Memorization, No Detection: Output Distribution-Based Contamination Detection in Small Language Models
Authors: Omer Sela
First: 2026-03-03T17:55:24+00:00 · Latest: 2026-03-03T17:55:24+00:00
Comments: 8 pages main text, 5 pages appendix, 9 figures, 7 tables. Code available at https://github.com/Sela-Omer/Contamination-Detection-Small-LM
Abstract
CDD, or Contamination Detection via output Distribution, identifies data contamination by measuring the peakedness of a model's sampled outputs. We study the conditions under which this approach succeeds and fails on small language models ranging from 70M to 410M parameters. Using controlled contamination experiments on GSM8K, HumanEval, and MATH, we find that CDD's effectiveness depends critically on whether fine-tuning produces verbatim memorization. With low-rank adaptation, models can learn from contaminated data without memorizing it, and CDD performs at chance level even when the data is verifiably contaminated. Only when fine-tuning capacity is sufficient to induce memorization does CDD recover strong detection accuracy. Our results characterize a memorization threshold that governs detectability and highlight a practical consideration: parameter-efficient fine-tuning can produce contamination that output-distribution methods do not detect. Our code is available at https://github.com/Sela-Omer/Contamination-Detection-Small-LM
Summary / 总结
CDD, or Contamination Detection via output Distribution, identifies data contamination by measuring the peakedness of a model's sampled outputs.
CoPeP: Benchmarking Continual Pretraining for Protein Language Models
Authors: Darshan Patil, Pranshu Malviya, Mathieu Reymond, Quentin Fournier, Sarath Chandar
First: 2026-02-27T19:04:32+00:00 · Latest: 2026-03-03T17:50:01+00:00
Comments: 29 pages, 25 figures
Abstract
Protein language models (pLMs) have recently gained significant attention for their ability to uncover relationships between sequence, structure, and function from evolutionary statistics, thereby accelerating therapeutic drug discovery. These models learn from large protein databases that are continuously updated by the biology community and whose dynamic nature motivates the application of continual learning, not only to keep up with the ever-growing data, but also as an opportunity to take advantage of the temporal meta-information that is created during this process. As a result, we introduce the Continual Pretraining of Protein Language Models (CoPeP) benchmark, a novel benchmark for evaluating continual learning approaches on pLMs. Specifically, we curate a sequence of protein datasets derived from the UniProt Knowledgebase spanning a decade and define metrics to assess pLM performance across 31 protein understanding tasks. We evaluate several methods from the continual learning literature, including replay, unlearning, and plasticity-based methods, some of which have never been applied to models and data of this scale. Our findings reveal that incorporating temporal meta-information improves perplexity by up to 7% even when compared to training on data from all tasks jointly. Moreover, even at scale, several continual learning methods outperform naive continual pretraining. The CoPeP benchmark offers an exciting opportunity to study these methods at scale in an impactful real-world application.
Summary / 总结
Protein language models (pLMs) have recently gained significant attention for their ability to uncover relationships between sequence, structure, and function from evolutionary statistics, thereby accelerating therapeutic drug discovery.
Compact Prompting in Instruction-tuned LLMs for Joint Argumentative Component Detection
Authors: Sofiane Elguendouze, Erwan Hain, Elena Cabrio, Serena Villata
First: 2026-03-03T15:32:58+00:00 · Latest: 2026-03-03T15:32:58+00:00
Comments: Under Review (COLM 2026)
Abstract
Argumentative component detection (ACD) is a core subtask of Argument(ation) Mining (AM) and one of its most challenging aspects, as it requires jointly delimiting argumentative spans and classifying them into components such as claims and premises. While research on this subtask remains relatively limited compared to other AM tasks, most existing approaches formulate it as a simplified sequence labeling problem, component classification, or a pipeline of component segmentation followed by classification. In this paper, we propose a novel approach based on instruction-tuned Large Language Models (LLMs) using compact instruction-based prompts, and reframe ACD as a language generation task, enabling arguments to be identified directly from plain text without relying on pre-segmented components. Experiments on standard benchmarks show that our approach achieves higher performance compared to state-of-the-art systems. To the best of our knowledge, this is one of the first attempts to fully model ACD as a generative task, highlighting the potential of instruction tuning for complex AM problems.
Summary / 总结
Argumentative component detection (ACD) is a core subtask of Argument(ation) Mining (AM) and one of its most challenging aspects, as it requires jointly delimiting argumentative spans and classifying them into components such as claims and premises.