Real Images, Worse Judgments: Evaluating Vision-Language Models on Concreteness and Imagery
Authors: Yifan Jiang, Ruoxi Ning, Sheng Yao, Freda Shi
First: 2026-05-26T17:24:59+00:00 · Latest: 2026-05-26T17:24:59+00:00
Abstract
Visual inputs are often assumed to improve language understanding in multimodal models. We examine this assumption by asking whether vision-language models (VLMs) can distinguish useful visual evidence from incidental image context in lexical judgments. We use human concreteness and imagery ratings because they span words with varying expected visual relevance, from abstract and low-imagery words to concrete and high-imagery words. We find that real-image contexts do not yield consistent gains and often hurt alignment with human ratings, most sharply when visual evidence is least relevant. Through probing and canonical correlation analysis, complemented by an attribution case study, we find that real-image contexts are associated with representational shifts and greater sensitivity to spurious visual cues, coinciding with weaker recoverability of the targeted lexical properties. We further show that instructing models to focus solely on textual content at inference time can reduce this degradation, with the clearest gains on these vulnerable subsets. Our findings suggest that current instruction-tuned VLMs need better calibration of when visual context should inform lexical judgments.
Summary / 总结
Visual inputs are often assumed to improve language understanding in multimodal models.
Transfer Learning using 66 Diseases for Disease Forecasting Applications
Authors: Lauren J Beesley, Alexander C Murph, Dave Osthus, Lauren A Castro
First: 2026-05-26T16:45:21+00:00 · Latest: 2026-05-26T16:45:21+00:00
Abstract
Disease forecasting models typically rely on a single data stream, making models brittle when histories are short or noisy. Recent top-performing models have shown that synthesizing multiple reporting systems for the same disease improves performance. Other recent work takes this idea a step further, using transfer learning to train a forecasting model for one disease using data from a different disease. We expand upon each of these approaches greatly, training machine learning models on data that span 66 infectious diseases and several data streams. We investigate the value of incorporating different data streams for forecasting 20 different disease data streams. We find that incorporating other data streams improves forecasting in the vast majority (84.9%) of time series and model structures considered. However, our work highlights that the quality of the added data matters, where adding data extremely different from the target data stream can sometimes degrade forecast performance. A major contribution of this work is in compiling a publicly-available database of data for use by the infectious disease forecasting community.
Summary / 总结
Disease forecasting models typically rely on a single data stream, making models brittle when histories are short or noisy.
Mechanistic Interpretability of Antibody Language Models Using SAEs
Authors: Rebonto Haque, Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Anna L. Beukenhorst, Charlotte M. Deane
First: 2025-12-05T15:18:50+00:00 · Latest: 2026-05-26T15:50:47+00:00
Comments: v3: 15 pages; corrected author list and affiliations in the main text; minor text changes; updated steering results following minor code changes; conclusions and findings remain unchanged; included link to data and code in the Data Availability section
Abstract
Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate autoregressive antibody language models, and steer their generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature-concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose a hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs suffice for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.
Summary / 总结
Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models.
Using reasoning LLMs to extract SDOH events from clinical notes
Authors: Ertan Dogan, Kunyu Yu, Yifan Peng
First: 2026-04-15T05:40:24+00:00 · Latest: 2026-05-26T15:36:36+00:00
Abstract
Social Determinants of Health (SDOH) refer to environmental, behavioral, and social conditions that influence how individuals live, work, and age. SDOH have a significant impact on personal health outcomes, and their systematic identification and management can yield substantial improvements in patient care. However, SDOH information is predominantly captured in unstructured clinical notes within electronic health records, which limits its direct use as machine-readable entities. To address this issue, researchers have employed Natural Language Processing (NLP) techniques using pre-trained BERT-based models, demonstrating promising performance but requiring sophisticated implementation and extensive computational resources. In this study, we investigated prompt engineering strategies for extracting structured SDOH events utilizing LLMs with advanced reasoning capabilities. Our method consisted of four modules: 1) developing concise and descriptive prompts integrated with established guidelines, 2) applying few-shot learning with carefully curated examples, 3) using a self-consistency mechanism to ensure robust outputs, and 4) post-processing for quality control. Our approach achieved a micro-F1 score of 0.866, demonstrating competitive performance compared to the leading models. The results demonstrated that LLMs with reasoning capabilities are effective solutions for SDOH event extraction, offering both implementation simplicity and strong performance.
Summary / 总结
Social Determinants of Health (SDOH) refer to environmental, behavioral, and social conditions that influence how individuals live, work, and age.
BhashaSetu: A Data-Centric Approach to Low-Resource Machine Translation
Authors: Param Thakkar, Anushka Yadav, Michael Tiemann, Abhi Mehta, Akshita Bhasin, Shrinivas Khedkar
First: 2026-05-26T14:03:20+00:00 · Latest: 2026-05-26T14:03:20+00:00
Abstract
We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT). Marathi, spoken by over 95 million people, remains underrepresented in high-quality parallel corpora across diverse domains. Our dataset comprises 2.78 million sentence pairs from heterogeneous sources including news, politics, healthcare, literature, and culture, with stemmed and lemmatized representations to support morphology-aware analysis. We benchmark multiple state-of-the-art translation models using BLEU, spBLEU, chrF++, and TER metrics, and conduct parameter-efficient fine-tuning of NLLB-200-distilled-600M using LoRA. A key finding from our ablation: corpus-level deduplication is the single largest preprocessing contributor to downstream quality (removing it reduces performance by 1.17 BLEU and 2.21 chrF++), demonstrating that disciplined cross-source corpus hygiene is a low-cost, high-impact intervention for low-resource, morphologically rich languages. The dataset is publicly released to promote reproducible and linguistically informed low-resource NMT research.
Summary / 总结
We present BhashaSetu, a linguistically enriched English--Marathi parallel dataset addressing persistent data limitations in low-resource neural machine translation (NMT).
On the Sensitivity of Instruction-tuned LLMs to Harmful Sentences in Long Inputs
Authors: Faeze Ghorbanpour, Alexander Fraser
First: 2025-10-07T12:33:21+00:00 · Latest: 2026-05-26T13:54:15+00:00
Abstract
Large language models (LLMs) increasingly operate on long inputs, yet their behavior when harmful sentences are sparsely embedded within such inputs remains poorly understood. We present a sensitivity analysis that probes how LLMs extract harmful sentences embedded in long inputs. We construct long inputs by combining neutral and harmful sentences, and systematically vary four factors: input length (600--30,000 tokens), the proportion of harmful sentences (0.01--0.50), harm realization (explicit vs. implicit), and the position of harmful sentences within the input (beginning, middle, end), enabling a controlled stress-test evaluation. Experiments across toxic, offensive, and hate content, and across LLaMA-3.1, Qwen-2.5, and Mistral, reveal consistent patterns: sensitivity is non-monotonic with respect to harmful prevalence, peaking at moderate levels; sensitivity degrades as input length increases; harmful sentences placed earlier in the input are more strongly prioritized; and explicit harm is more reliably identified than implicit harm. These findings provide a systematic view of how LLMs prioritize harmful sentences in long input under controlled stress conditions, highlighting both emerging strengths and remaining challenges for safety-related use.
Summary / 总结
Large language models (LLMs) increasingly operate on long inputs, yet their behavior when harmful sentences are sparsely embedded within such inputs remains poorly understood.
Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes
Authors: Chen Ling, Tongwei Zhang, Hanqian Li, Nai Ding
First: 2026-01-12T17:15:45+00:00 · Latest: 2026-05-26T12:27:32+00:00
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in mainstream visual understanding tasks, but their ability to process action scenes that contradict everyday common sense remains undertested. To address this gap, we introduce CAIT, a benchmark comprising 400 high-fidelity synthetic scenes focused on counter-intuitive visual actions, such as ``a rabbit is chasing a tiger'', where visual evidence explicitly contradicts common-sense expectations. We evaluate human, leading proprietary models (e.g., Claude and Gemini), and 14 representative open-source MLLMs. Humans achieve near-perfect performance (around 0.95 accuracy) and proprietary models demonstrate robust understanding (achieving up to 0.88 accuracy), standard open-source instruction-tuned models perform at the chance level. Further analysis demonstrates that this failure is driven by a strong language prior: rather than trusting the visual input, they automatically override the anomalous visual signals with statistically common text descriptions. Although introducing Chain-of-Thought reasoning mechanisms can improve accuracy, it significantly slows down the response and generates a new failure mode: models overthink the scenario and refuse to accept the actual visual content simply because it violates real-world physical laws. Finally, we demonstrate that targeted fine-tuning and structured prompting can effectively mitigate this reliance on language priors, enabling open-source models to accurately ground their reasoning in actual visual evidence.
Summary / 总结
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in mainstream visual understanding tasks, but their ability to process action scenes that contradict everyday common sense remains undertested.
Telenor Nordics Customer Service self-help corpus
Authors: Mike Riess
First: 2026-05-26T11:52:55+00:00 · Latest: 2026-05-26T11:52:55+00:00
Comments: 8 pages, 2 figures, 5 tables. Submitted to Nordic Machine Intelligence. Dataset: https://zenodo.org/records/19493152
Abstract
This paper presents a multilingual customer service self-help corpus comprising 1,122 manually validated documents in Finnish, Danish, Norwegian, and Swedish, totaling over one million tokens. The documents have been sourced from the public self-help pages of four Nordic telecommunications operators and subsequently filtered for person-identifiable information and relevance through a combined LLM and human annotation pipeline. Domain-specific datasets for Nordic languages remain scarce, particularly in customer service: a domain of growing importance for retrieval-augmented generation, cross-lingual transfer learning, and emerging agent-based service architectures. An analysis of the corpus reveals substantial variation in document length and structure across operators, reflecting distinct editorial strategies, as well as broad topical coverage spanning network hardware, mobile services, TV and streaming, billing, and account management. The dataset is publicly available under a CC-BY-NC-SA-4.0 license at https://zenodo.org/records/19493152, intended to support reproducible research in Nordic NLP and information retrieval.
Summary / 总结
This paper presents a multilingual customer service self-help corpus comprising 1,122 manually validated documents in Finnish, Danish, Norwegian, and Swedish, totaling over one million tokens.
Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal
Authors: Kia-Jüng Yang, Dominik Meier, Jiachen Zhao, Terry Ruas, Bela Gipp
First: 2026-05-26T09:41:15+00:00 · Latest: 2026-05-26T09:41:15+00:00
Abstract
Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state that may complicate control mechanisms such as refusal. Unlike instruction-tuned LLMs, where refusal is mediated by a single directional subspace, refusal in large reasoning models (LRMs) additionally depends on the CoT. In DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in only 39% of cases when the CoT is kept fixed, but removing the CoT entirely increases this to 70%, indicating that the CoT actively reinforces refusal. In a two-stage intervention where the model regenerates its CoT under activation steering, refusal is reversed in 94% of cases, while the resulting CoT alone retains 48% of this effect even after steering is removed. This suggests that the CoT can carry and reconstruct the compliance signal independently. These findings indicate that refusal in LRMs is jointly encoded in residual stream activations and CoT. This joint activation makes LRM more robust against activation-level interventions alone, but exposes CoT to a possible alternative surface attack.
Summary / 总结
Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state that may complicate control mechanisms such as refusal.
TFD: A Comprehensive Structured Tibetan Foundation Dataset for Low-Resource Language Processing and Large-Scale Modeling
Authors: Cheng Huang, Fan Gao, Nyima Tashi, Yutong Liu, Yadi Liu, Wenbin Wei, Xiangxiang Wang, Yongbin Yu
First: 2025-03-24T02:17:41+00:00 · Latest: 2026-05-26T08:47:50+00:00
Abstract
Large Language Models (LLMs) have achieved remarkable success in high-resource languages, yet progress in Tibetan remains severely constrained. While recent efforts have begun to address pre-training data scarcity for Tibetan, a more fundamental gap persists: no existing resource supports the complete LLM development pipeline, spanning pre-training, instruction tuning, safety alignment, preference optimization, and reasoning supervision. We introduce the Tibetan Foundation Dataset (TFD), the first structured, large-scale, and expert-curated dataset covering all key stages of Tibetan large language modeling. TFD comprises TIBSTC, a unified corpus of over 11 billion tokens with curated sub-datasets for instruction tuning, safety alignment, and preference optimization, and TIBSTC-CoT, the first large-scale Tibetan chain-of-thought dataset. We demonstrate its utility by training the Sun-Shine family of Tibetan LLMs, achieving substantial improvements over strong baselines on understanding, safety, reasoning, and generation benchmarks. These results underscore that advancing low-resource language modeling requires not only scale, but a structurally complete data ecosystem. We release TFD to facilitate reproducible research and the development of robust, culturally aligned Tibetan LLMs. Code and data are available at https://github.com/Vicentvankor/sun-shine.
Summary / 总结
Large Language Models (LLMs) have achieved remarkable success in high-resource languages, yet progress in Tibetan remains severely constrained.
Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets
Authors: Ashima Khanna, Dominik Grimm
First: 2026-05-26T08:29:36+00:00 · Latest: 2026-05-26T08:29:36+00:00
Abstract
Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off-policy generative approaches often degrade under surrogate noise, and position-agnostic mutation proposals risk disrupting functionally critical residues. We introduce SILO, a trajectory-level self-improvement imitation framework for oracle-budgeted protein design. SILO uses a hierarchical edit policy that decomposes each mutation into a position choice followed by a residue choice. In each active-learning round, the policy samples candidate trajectories via incremental stochastic beam search without replacement (SBS), and a UCB-based proxy ensemble, combined with an alanine-scan fitness score (AFS), selects candidates with functionally relevant edits for in silico oracle evaluation. The policy is then updated by next-action cross-entropy imitation on the round's best oracle-labeled trajectories, avoiding value-function estimation. Across eight reproduced protein fitness landscapes and five strong baselines from prior work, SILO achieves the highest maximum and top-100 mean fitness on 8 of 8 landscapes within our evaluations, often exhibiting faster early-stage improvement. In low-data and noisy-proxy stress tests on two landscapes per setting, SILO remains competitive or best when several baselines degrade. Ablations show that SBS with AFS account for much of the gains, with iterative imitation providing additional improvement. Code is available at: https://github.com/grimmlab/SILO.git
Summary / 总结
Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative.
Property Enhanced Instruction Tuning for Multi-task Molecule Generation with Large Language Models
Authors: Xuan Lin, Long Chen, Yile Wang, Yangyang Chen, Xiangxiang Zeng
First: 2024-12-24T01:48:07+00:00 · Latest: 2026-05-26T07:33:20+00:00
Comments: 9
Abstract
Large language models (LLMs) are widely applied in various natural language processing tasks such as question answering and machine translation. However, due to the lack of labeled data and the difficulty of manual annotation for biochemical properties, the performance for molecule generation tasks is still limited, especially for tasks involving multi-properties constraints. In this work, we present a two-step framework PEIT (\textbf{P}roperty \textbf{E}nhanced \textbf{I}nstruction \textbf{T}uning) to improve LLMs for molecular-related tasks. In the first step, we use textual descriptions, SMILES, and biochemical properties as multimodal inputs to pre-train a model called PEIT-GEN, by aligning multi-modal representations to synthesize instruction data. In the second step, we fine-tune existing open-source LLMs with the synthesized data, the resulting PEIT-LLM can handle molecule captioning, text-based molecule generation, molecular property prediction, and our newly proposed multi-constraint molecule generation tasks. Experimental results show that our pre-trained PEIT-GEN outperforms MolT5, BioT5, MolCA and Text+Chem-T5 in molecule captioning, demonstrating modalities align well between textual descriptions, structures, and biochemical properties. Furthermore, PEIT-LLM shows promising improvements in multi-task molecule generation, demonstrating the effectiveness of the PEIT framework for molecular tasks. The code and appendix are available at https://github.com/chenlong164/PEIT.
Summary / 总结
Large language models (LLMs) are widely applied in various natural language processing tasks such as question answering and machine translation.
LATTE: Forecasting Peer Anchored Preference Trajectories for Personalized LLM Generation
Authors: Jinze Li, Xiaoyan Yang, Shuo Yang, Jinfeng Xu, Yue Shen, Jian Wang, Jinjie Gu, Edith Cheuk-Han Ngai
First: 2026-05-26T06:50:45+00:00 · Latest: 2026-05-26T06:50:45+00:00
Comments: Under review
Abstract
Personalized generation with frozen large language models requires a conditioning signal that is both compact and current. Existing personalization methods typically retrieve or summarize user histories in text, or compress them into static latent profiles and soft prompts. These approaches are efficient, but they treat a user's past behavior as an aggregate profile and therefore mix stable identity, recent drift, and item content in the same representation. We propose LAtent Trajectory Tracking and Extrapolation (LATTE), a framework that represents personalization as forecasting a peer anchored relative preference state. For each historical session, LATTE subtracts a time masked baseline formed from comparable users who responded to the same item, producing a state that measures how the target user differs from peers under a shared item context. A lightweight sequence predictor then forecasts the next state in this trajectory, and a State to Token Bridge injects the forecast into a frozen instruction tuned LLM through a single anchored soft token. We provide a latent factor analysis showing when peer anchoring cancels shared item variation and why temporal forecasting trades off stale averages against noisy recent states. Experiments on Amazon Reviews 2023 and MemoryCD show that LATTE consistently outperforms retrieval, summary memory, static latent profiles, difference aware latent profiles, and soft prompt compression baselines. On Amazon Reviews 2023, LATTE improves average ROUGE-L from 0.219 for a static latent profile and 0.245 for the strongest added latent compression baseline to 0.259. Additional pairwise comparisons and diagnostic analyses suggest that the improvement is mainly due to forecasting user-specific trajectory information, rather than merely adding a soft prompt interface.
Summary / 总结
Personalized generation with frozen large language models requires a conditioning signal that is both compact and current.
How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework
Authors: Björn Nieth, Marianna Gracheva, Michaela Mahlberg, Bjoern Eskofier, Emmanuelle Salin
First: 2026-05-22T14:04:25+00:00 · Latest: 2026-05-26T06:10:05+00:00
Comments: 8.5 pages (main) + 31 pages appendix, 29 figures, 10 tables. Code and data: https://github.com/BjoernNieth/Register_Aware_LLMs
Abstract
While factual correctness and task-performance have been in focus of Large Language Model (LLM) research for a long time, the fundamental question of how human-like generated texts are on a linguistic level has been underexplored. From a corpus-linguistic perspective, language production is inherently context-dependent, with distinct communicative contexts giving rise to differences in frequencies and co-occurrence patterns of linguistic features. A text failing to adhere to these patterns can be content-wise correct, but still be unfavorable to human readers. In this work, we propose a context-aware evaluation framework in which human-likeness is assessed using a two-sample problem between the linguistic feature distribution of a human reference corpus for a given register and a corresponding LLM-generated corpus. We implement this framework using the Maximum Mean Discrepancy (MMD) and the 67 lexico-grammatical features introduced by Biber, which are commonly applied in corpus linguistics. In our experiments, we compare seven instruction-tuned, open-source models across five English-language datasets spanning distinct registers against a human baseline. While across all tested setups, LLMs deviate from the human baseline, which models are closest to human language depends on the register and is not dictated by model size.
Summary / 总结
While factual correctness and task-performance have been in focus of Large Language Model (LLM) research for a long time, the fundamental question of how human-like generated texts are on a linguistic level has been underexplored.
Model Unlearning Objectives Vary for Distinct Language Functions
Authors: Berk Atil, Vipul Gupta, Rebecca J. Passonneau
First: 2026-05-26T02:10:55+00:00 · Latest: 2026-05-26T02:10:55+00:00
Abstract
Large language models (LLMs) learn undesirable properties during pretraining, including dangerous knowledge and toxic text generation. Just as post-training uses different objectives to shape different behaviors, we argue that unlearning methods should be designed for the language function at issue. To study this, we consider two mechanistically distinct unlearning goals, dangerous-knowledge unlearning and toxicity unlearning. For dangerous knowledge, we introduce a cosine-based, meta-learned variant of RMU. For toxicity, we propose a multi-layer objective based on layer-specific probe directions. Across four open-source 7-8B models, our methods achieve strong results, based on distinct training objectives for the two types of unlearning. Overall, our results suggest that unlearning should be studied as a family of problems, analogous to the multiple types of LLM post-training.
Summary / 总结
Large language models (LLMs) learn undesirable properties during pretraining, including dangerous knowledge and toxic text generation.
Internalizing Tool Knowledge in Small Language Models via QLoRA Fine-Tuning
Authors: Yuval Shemla, Ayal Yakobe, Tanmay Agarwal, Dhaval Patel, Kaoutar El Maghraoui
First: 2026-05-18T02:48:46+00:00 · Latest: 2026-05-26T01:46:08+00:00
Abstract
Large language models are increasingly used as planning components in agentic systems, but current tool-use pipelines often require full tool schemas to be included in every prompt, creating substantial token overhead and limiting the practicality of smaller models. This paper investigates whether tool-use knowledge can be internalized into small language models through parameter-efficient fine-tuning, enabling structured planning without explicit tool descriptions at inference time. Using AssetOpsBench as the primary benchmark, we fine-tune Gemma 4 E4B and Qwen3-4B with 8-bit QLoRA on approximately 1,700 tool-use examples spanning tool knowledge, question-to-plan mappings, and execution-style traces. We evaluate the resulting models under description-free inference, where the prompt omits the tool catalog entirely. The fine-tuned models outperform an informed unfine-tuned baseline that receives full tool descriptions, reducing input length by 82.6\% while improving structural and LLM-judge planning scores. In the best Gemma run, the model achieves an AT-F1 of 0.65 and an overall judge score of 3.88, compared with 0.47 and 2.88 for the informed baseline. Qwen3-4B achieves a strong overall judge score of 3.78 while using 62\% less memory and running 2.5$\times$ faster than Gemma, though it also exhibits greater catastrophic forgetting on general multiple-choice benchmarks. Additional ablations show that LoRA rank controls a quality--retention trade-off, with $r=32$ maximizing planning quality and smaller ranks preserving more general knowledge. These results suggest that, for fixed tool catalogs, QLoRA fine-tuning can shift tool knowledge from prompt context into model weights, substantially reducing inference overhead while maintaining or improving tool-planning quality.
Summary / 总结
Large language models are increasingly used as planning components in agentic systems, but current tool-use pipelines often require full tool schemas to be included in every prompt, creating substantial token overhead and limiting the practicality of smaller models.
Vectors Are Not Neutral: Sensitive-Information Inference from Exported LLM Representations in Summarization
Authors: Weixin Liu, Bowen Qu, Juming Xiong, Congning Ni, Bradley A. Malin, Zhijun Yin
First: 2026-05-26T01:38:13+00:00 · Latest: 2026-05-26T01:38:13+00:00
Comments: 30 pages, 2 figures; preprint
Abstract
Large language model (LLM) summarization systems may pass compact vector representations of private inputs to downstream retrieval, monitoring, audit, or analytic workflows. Even when source documents remain access-restricted, derived vectors may be handled under different access controls and still support sensitive-information inference, creating a residual information-disclosure risk. We study this issue in clinical discharge-summary generation as a high-stakes case study, using electronic health record (EHR)-recorded race as a controlled sensitive-label audit. We audit two artifacts that a system might retain or expose to downstream components: the final prompt-token hidden state and the mean-pooled prompt representation. Our results show that reducing recoverability of the case-study sensitive label from one exported artifact does not necessarily reduce recoverability from another. As a mitigation case study, we introduce SurfaceLoRA, an exported-vector-targeted parameter-efficient fine-tuning method that uses a gradient-reversal discriminator attached to a designated exported vector. Under a balanced five-way probing protocol, SurfaceLoRA reduces EHR-recorded race recoverability from the targeted final-token artifact toward chance while preserving summarization utility, yet recoverability remains substantially higher from untargeted pooled artifacts. These findings show that privacy auditing and mitigation should be performed on the exact vector artifact retained or exposed to downstream components.
Summary / 总结
Large language model (LLM) summarization systems may pass compact vector representations of private inputs to downstream retrieval, monitoring, audit, or analytic workflows.
Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery
Authors: Jowaria Khan, Anindya Sarkar, Yevgeniy Vorobeychik, Elizabeth Bondi-Kelly
First: 2026-02-19T18:30:18+00:00 · Latest: 2026-05-25T19:54:05+00:00
Abstract
In environmental monitoring, data collection is often costly, sparse, and shaped by urgent public-health needs. This is particularly true for cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, where discussions with domain experts and environmental organizations highlight the need to strategically identify high-risk, under-observed regions under tight sampling budgets. More broadly, similar challenges arise in disaster response and public health settings, where dynamic environments make it essential to efficiently uncover hidden targets from limited ground truth. Yet sparse and biased geospatial labels limit the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of *concept relevance*, capturing how domain-specific factors influence target presence: a *concept-weighted uncertainty sampling strategy*, where uncertainty is modulated by learned relevance from readily available concepts such as land cover and source proximity; and a *relevance-aware meta-batch formation strategy* that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. We evaluate our framework on PFAS contamination discovery as a real-world inspired environmental monitoring task, demonstrating robust target discovery under limited data and changing conditions.
Summary / 总结
In environmental monitoring, data collection is often costly, sparse, and shaped by urgent public-health needs.
Tool Calling is Linearly Readable and Steerable in Language Models
Authors: Zekun Wu, Ze Wang, Seonglae Cho, Yufei Yang, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz
Venue: ACL
First: 2026-05-08T16:47:08+00:00 · Latest: 2026-05-25T19:38:06+00:00
Comments: 24 pages. ACL ARR May 2026 submission (EMNLP 2026 preferred venue); v2 reflects revised manuscript
Abstract
When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. As agents take on consequential actions, one bad tool call can do real damage. We currently have no way to look inside the model and catch the mistake before it happens; this paper shows that we can. Inside the model, the choice of tool is carried by a single direction in activation space, one direction per pair of tools. Adding that direction during generation switches which tool the model picks. Across 12 instruction-tuned and 6 base models spanning Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), this works at 83-100% accuracy on 4B+ instruction-tuned models on a 15-tool synthetic benchmark and at 77-94% on the real-API benchmark $τ$-bench airline. The JSON arguments that follow automatically adapt to the new tool's schema, so flipping the name is enough. The same per-tool directions also flag likely errors before they happen: queries where the model is unsure between two tools fail 21x more often than queries where it is not (Gemma 3 27B). This is not just topic injection: random vectors at the same magnitude give a 0% switch rate, and a probe within a single domain (14 airline tools that share one topic) still reads which tool the model will call at top-1 61-89% across five 4B-14B models. Even base models already carry the right tool internally before they can emit it: reading the chosen tool off the model's internal state (cosine readout) recovers 61-82% accuracy on BFCL while base generation lands at 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. Our results cover single-turn, fixed-menu settings; on multi-turn agent loops the same intervention is less stable (matched-baseline gain or loss of up to 30 percentage points with no consistent direction).
Summary / 总结
When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed.
Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Authors: Taha Koleilat, Hassan Rivaz, Yiming Xiao
Venue: MICCAI 2026
First: 2026-05-25T19:30:22+00:00 · Latest: 2026-05-25T19:30:22+00:00
Comments: MICCAI 2026 Early Accept; Project Page: https://tahakoleilat.github.io/Evi-Steer
Abstract
Parameter-efficient adaptation of vision-language foundation models is crucial for precise multimodal understanding of biomedical images, yet existing methods remain deterministic and often struggle under domain shift or ambiguous image-text alignment. This limitation is particularly critical in the clinic, where models should remain robust in low-data regimes and domain shifts. We present Evi-Steer, an evidential cross-modal low-dimensional steering framework for BiomedCLIP that enables uncertainty-aware parameter-efficient fine-tuning while updating only 0.11% of total model parameters. Our approach performs lightweight low-dimensional token updates in both vision and text encoders while simultaneously estimating epistemic uncertainty. These uncertainty estimates update gate residuals, allowing the model to adapt conservatively when evidence is weak. Furthermore, we introduce cross-modal confidence fusion based on Dempster-Shafer theory, enabling visual adaptation to be conditioned on textual confidence and suppressing conflicting or uncertain cross-modal updates. We conduct a comprehensive evaluation on 15 biomedical imaging datasets spanning 8 organs and 8 imaging modalities under few-shot learning and domain generalization settings. Evi-Steer consistently outperforms state-of-the-art methods under few-shot learning and domain shift settings, demonstrating a practical and robust pathway for deploying vision-language models in real-world clinical settings. Code is available at https://github.com/HealthX-Lab/Evi-Steer.
Summary / 总结
Parameter-efficient adaptation of vision-language foundation models is crucial for precise multimodal understanding of biomedical images, yet existing methods remain deterministic and often struggle under domain shift or ambiguous image-text alignment.
Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning
Authors: Jun-Tao Tang, Yu-Cheng Shi, Zhen-Hao Xie, Da-Wei Zhou
First: 2026-05-25T17:59:28+00:00 · Latest: 2026-05-25T17:59:28+00:00
Comments: Code is available at https://github.com/LAMDA-CL/Prism
Abstract
Multimodal Large Language Models (MLLMs) achieve versatility by reformulating diverse tasks into a unified instruction-following framework via instruction tuning. However, real-world deployment requires continuous adaptation to emerging tasks, motivating Multimodal Continual Instruction Tuning (MCIT). Despite its growing importance, current MCIT research is hindered by severe engineering bottlenecks. Existing methods are typically implemented by directly modifying the base MLLM codebase, which imposes substantial implementation overhead and yields method-specific architectures that severely limit code reuse and fair comparison. To address this, we introduce Prism, a plug-in reproducible codebase specifically designed for scalable MCIT research. It separates algorithmic development from the backbone implementation via a lightweight plugin registration mechanism, enabling new strategies to be integrated as independent plugins without modifying the underlying MLLM codebase, thereby eliminating structural fragmentation and accelerating method development. Prism natively supports widely used large-scale training pipeline, thereby enabling reproducible and scalable MCIT experimentation. Code is available at https://github.com/LAMDA-CL/Prism.
Summary / 总结
Multimodal Large Language Models (MLLMs) achieve versatility by reformulating diverse tasks into a unified instruction-following framework via instruction tuning.
Fine-Tuning Causal LLMs for Text Classification: Embedding-Based vs. Instruction-Based Approaches
Authors: Amirhossein Yousefiramandi, Ciaran Cooney
First: 2025-12-14T13:02:06+00:00 · Latest: 2026-05-25T16:57:18+00:00
Comments: 20 pages, 5 figures
Abstract
We explore efficient strategies to fine-tune decoder-only Large Language Models (LLMs) for downstream text classification under resource constraints. Two approaches are investigated: (1) attaching a classification head to a pretrained causal LLM and fine-tuning it on the task, using the LLM's final-token embedding as a sequence representation, and (2) instruction-tuning the LLM in a prompt-to-response format for classification. To enable single-GPU fine-tuning of models up to 8B parameters, we combine 4-bit model quantization with Low-Rank Adaptation (LoRA) for parameter-efficient training. Experiments on two patent benchmarks, a 5-class single-label internal corpus and the public WIPO-Alpha multi-label dataset with 14 categories, show that the embedding-head approach matches or exceeds fine-tuned BERT baselines on single-label classification while training 10-30x fewer parameters. Instruction-tuning is competitive only in the multi-label regime, and only with substantially larger trainable budgets of at least 100M parameters. These results demonstrate that directly leveraging the internal representations of causal LLMs, together with efficient fine-tuning techniques, yields strong classification performance under limited computational resources. We discuss the advantages of each approach and outline practical guidelines and future directions for optimizing LLM fine-tuning in classification scenarios.
Summary / 总结
We explore efficient strategies to fine-tune decoder-only Large Language Models (LLMs) for downstream text classification under resource constraints.
MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models
Authors: Shristi Das Biswas, Kaushik Roy
First: 2026-05-25T16:22:09+00:00 · Latest: 2026-05-25T16:22:09+00:00
Abstract
Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or naive score-based selection often yields suboptimal training subsets. We introduce MAGIC, a training-free, forward-only coreset selection method designed to construct compact yet behaviorally faithful subsets for multimodal instruction tuning. MAGIC is built on three intrinsic signals extracted from a pretrained VLM: Multimodal Gain, which measures the likelihood improvement obtained from visual input; Bridging Relevance, which captures the sharpness of answer-token grounding over visual tokens; and Skill-Neuron Signatures, which characterize the functional computation elicited by each sample via top-activated feed-forward neurons. MAGIC combines these signals in a three-stage pipeline: filtering low-gain examples, ranking candidates by a normalized quality objective, and performing bucket-wise budget allocation over discrete neuron signatures to preserve latent multimodal skill coverage. This formulation avoids backpropagation, auxiliary selector training, and expensive clustering in continuous activation spaces, while remaining efficient and easily deployable in existing VLMs. Across LLaVA-665K and Vision-Flan datasets, and transfer settings to large target models, LLaVA-1.5-7B and -13B, MAGIC consistently improves over strong baselines under matched 20% budgets: it achieves 100.3% relative performance to full finetuning on LLaVA-665K and 101.6% relative performance on Vision-Flan-186K, while yielding a 73.7% reduction in wall-clock run time.
Summary / 总结
Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors.
On the Communication Complexity of Decentralized Stochastic Bilevel Optimization
Authors: Yihan Zhang, My T. Thai, Jie Wu, Hongchang Gao
First: 2023-11-19T14:56:26+00:00 · Latest: 2026-05-25T16:06:04+00:00
Abstract
Stochastic bilevel optimization finds widespread applications in machine learning, including meta-learning, hyperparameter optimization, and neural architecture search. To extend stochastic bilevel optimization to distributed data, several decentralized stochastic bilevel optimization algorithms have been developed. However, existing methods often suffer from slow convergence rates and high communication costs in heterogeneous settings, limiting their applicability to real-world tasks. To address these issues, we propose two novel decentralized stochastic bilevel gradient descent algorithms based on \textit{simultaneous} and \textit{alternating} update strategies. Our algorithms can achieve faster convergence rates and lower communication costs than existing methods. Importantly, our convergence analyses do not rely on strong assumptions regarding heterogeneity. More importantly, our theoretical analyses clearly disclose how the computation and communication regarding the Hessian-inverse-vector product under the heterogeneous setting affects the convergence rate. To the best of our knowledge, this is the first time such favorable theoretical results have been achieved with mild assumptions in the heterogeneous setting. Furthermore, we demonstrate how to establish the convergence rate for the alternating update strategy when combined with the variance-reduced gradient. Finally, experimental results confirm the efficacy of our algorithms.
Summary / 总结
Stochastic bilevel optimization finds widespread applications in machine learning, including meta-learning, hyperparameter optimization, and neural architecture search.
Quantitative Evaluation of the Severity of Posttraumatic Stress Disorder through Transfer Learning from Specific Phobia Data
Authors: Nicolas Ricka, Gauthier Pellegrin, Denis A. Fompeyrine, Thomas Rohaly, Leah Enders, Heather Roy
First: 2026-05-25T15:13:45+00:00 · Latest: 2026-05-25T15:13:45+00:00
Comments: Submitted to a peer-reviewed journal, comments welcome
Abstract
Posttraumatic stress disorder (PTSD) is a prevalent and debilitating mental health condition with significant personal and societal impacts. Current clinical assessments of PTSD often rely on subjective evaluations, which can be time-consuming, costly, and prone to human bias. This study proposes a machine learning (ML) approach based on multivariate kernel density estimation (MKDE) technique for the objective evaluation of PTSD severity. We collected heart rate (HR) and galvanic skin response (GSR) signals as well as PTSD Checklist - Military Version (PCL-M) labels from 21 participants during an immersive simulation. A fear-response model was trained on a public arachnophobia dataset, and predictive features of PTSD were extracted from the fear-response curves estimated on the military dataset. The model achieved an accuracy of 86\% in classifying PTSD status, effectively distinguishing participants with and without PTSD (PCL-M threshold of 36). The average mean absolute error (MAE) of the models is 5.6, and it estimated a clinical PTSD severity scale with a mean absolute percentage error of 17\%. Our algorithm demonstrates promising potential for enhancing estimation of PTSD severity and followup by offering an objective and low-effort evaluation approach using physiology. These findings suggest clinical utility in both screening and follow-up settings.
Summary / 总结
Posttraumatic stress disorder (PTSD) is a prevalent and debilitating mental health condition with significant personal and societal impacts.
Co-folding model guided by structural proteomics
Authors: Alon Shtrikman, Nitzan Simchi, Michal Ran Shchory, Sagie Brodsky, Eran Seger, Kirill Pevzner
First: 2026-05-25T14:54:08+00:00 · Latest: 2026-05-25T14:54:08+00:00
Abstract
Protein structure generative models excel at predicting single protein static structures from sequence, but routinely fail to capture the correct conformational state of protein complexes, critical for protein design and induced proximity modalities such as antibodies and PROTACs. While structural proteomics techniques like Cross-Linking Mass Spectrometry (XL-MS) and Hydrogen-Deuterium Exchange (HDX-MS) offer valuable spatial and dynamic insights, integrating these sparse, heterogeneous measurements into these models remains an open challenge. Here, we bridge this gap by combining structural proteomics data with the rich biophysical priors learned by pretrained diffusion models. We introduce AIMS-Fold, an inference-time guided-diffusion framework that actively steers the generative sampling trajectory using differentiable physical potentials derived from XL-MS spatial restraints and HDX-MS solvent accessibility profiles. We demonstrate that these structural methods individually enhance predictive accuracy, and their integration yields synergistic improvement. Crucially, by leveraging these experimental restraints, AIMS-Fold achieves higher accuracy on challenging induced proximity targets than purely computational, unguided state-of-the-art models like Boltz-2. This establishes our framework as a powerful, integrative computational approach for the structure based drug design of induced proximity drugs. Evaluation code will be made publicly available upon publication.
Summary / 总结
Protein structure generative models excel at predicting single protein static structures from sequence, but routinely fail to capture the correct conformational state of protein complexes, critical for protein design and induced proximity modalities such as antibodies and PROTACs.
Transformer-based few-shot learning for modeling Electricity Consumption Profiles with minimal data across thousands of domains
Authors: Weijie Xia, Gao Peng, Chenguang Wang, Peter Palensky, Eric Pauwels, Pedro P. Vergara
Venue: International Journal of Electrical Power & Energy Systems, Volume/Issue (February 2026), Article 111575
First: 2024-08-15T19:54:53+00:00 · Latest: 2026-05-25T14:30:10+00:00
Abstract
Electricity Consumption Profiles (ECPs) are crucial for operating and planning power distribution systems, especially with the increasing number of low-carbon technologies such as solar panels and electric vehicles. Traditional ECP modeling methods typically assume the availability of sufficient ECP data. However, in practice, the accessibility of ECP data is limited due to privacy issues or the absence of metering devices. Few-shot learning (FSL) has emerged as a promising solution for ECP modeling in data-scarce scenarios. Nevertheless, standard FSL methods, such as those used for images, are unsuitable for ECP modeling because (1) these methods usually assume several source domains with sufficient data and several target domains. However, in the context of ECP modeling, there may be thousands of source domains, e.g., households with a moderate amount of data, and thousands of target domains, e.g., households that ECP are required to be modeled. (2) Standard FSL methods usually involve cumbersome knowledge transfer mechanisms, such as pre-training and fine-tuning. To address these limitations, this paper proposes a novel FSL framework that integrates Transformers with Gaussian Mixture Models (GMMs) for ECP modeling. The proposed approach is fine-tuning-free, computationally efficient, and robust even with extremely limited data. Results show that our method can accurately restore the complex ECP distribution with a minimal amount of ECP data (e.g., only 1.6% of the complete domain dataset) and outperforms state-of-the-art time series modeling methods in the context of ECP modeling.
Summary / 总结
Electricity Consumption Profiles (ECPs) are crucial for operating and planning power distribution systems, especially with the increasing number of low-carbon technologies such as solar panels and electric vehicles.
When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills
Authors: Yunfei Wang, Xiaohao Xu, Yang Li, Xiaonan Huang
First: 2026-05-25T13:29:45+00:00 · Latest: 2026-05-25T13:29:45+00:00
Comments: 20 pages, 8 figures
Abstract
Large language models (LLMs) are increasingly used as proposal generators for evolutionary robot design, yet most loops remain memoryless: simulator results shape the next population but are not preserved as reusable design knowledge. We present Auto-Robotist, a self-evolving LLM agent that distills morphology-search traces into an explicit natural-language skill library. Each skill stores a structural archetype, evidence-grounded positive and negative rules, and the evaluated designs that support them, making design memory inspectable rather than implicit in a population. During search, the agent retrieves skills to condition LLM edits of elite bodies while retaining a Genetic Algorithm (GA) mutation path for exploration; after evaluation, it updates the library through Add, Diagnose, and Merge. Across seven EvoGym tasks spanning locomotion, traversal, and object interaction, Auto-Robotist improves cold-start 5x5 search and transfers learned skills to 10x10 design spaces, where reference-conditioned transfer outperforms GA on every task. These results suggest that LLM agents can convert expensive physical evaluations into reusable, auditable design principles. Our code will be released upon acceptance.
Summary / 总结
Large language models (LLMs) are increasingly used as proposal generators for evolutionary robot design, yet most loops remain memoryless: simulator results shape the next population but are not preserved as reusable design knowledge.
RotMoLE: Enhancing Mixture of Low-Rank Experts through Rotational Gating Mechanism
Authors: Mengyang Sun, Maochuan Dou, Tao Feng, Dan Zhang, Yihao Wang, Junpeng Liu, Yifan Zhu, Jie Tang
First: 2026-05-25T08:18:36+00:00 · Latest: 2026-05-25T08:18:36+00:00
Abstract
While Large Language Models (LLMs) are commonly fine-tuned to handle domain-specific tasks before being applied to vertical applications, adapting them to complex scenarios with diverse specialized knowledge remains challenging. Meanwhile, Mixture-of-Experts (MoE) architecture has risen as a crucial paradigm for training LLMs, and some recent works have also incorporated MoE into Parameter-Efficient Fine-Tuning (PEFT) to propose the Mixture of Low-rank Experts (MoE-LoRA), to enhance the power of low-rank adapters for learning complicated knowledge. However, conventional gating mechanisms in MoE typically apply only a scalar reweighing to selected experts, thereby limiting their underlying capacity of representation and generalization. Motivated and enabled by the low-rank structures in MoE-LoRA, we propose RotMoLE, a specialized MoE framework for low-rank experts featuring an additional rotation gate. Beyond simple scaling, RotMoLE implements a rotation mechanism for each selected expert, enabling superior expert exploitation and specialization for learning diverse data, especially when expert candidates are limited. Empirical results on complex multi-task and multilingual training scenarios validate our effectiveness.
Summary / 总结
While Large Language Models (LLMs) are commonly fine-tuned to handle domain-specific tasks before being applied to vertical applications, adapting them to complex scenarios with diverse specialized knowledge remains challenging.
SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models
Authors: Khalid Yusuf Dahir
First: 2026-05-25T04:45:44+00:00 · Latest: 2026-05-25T04:45:44+00:00
Comments: 12 pages, 3 figures, 4 tables. Code: https://github.com/khaledyusuf44/somalibench_eval Dataset: https://huggingface.co/datasets/khaledyusuf44/somalibench-v0
Abstract
Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0, a native-author-verified benchmark of 100 harmful-intent prompts paired across English and Somali. Each of Llama-3.1-8B-Instruct, Gemma-2-9B-Instruct, Qwen-2.5-7B-Instruct, and Aya-23-8B is run locally with temperature 0 and the same English "helpful, harmless, and honest" (HHH) system prompt. A pinned Claude Sonnet snapshot (claude-sonnet-4-5-20250929) classifies each response as refused, complied, or unclear; the native author spot-checks a stratified 80-row sample. We find large English-to-Somali refusal gaps for all four models: Llama-3.1-8B (0.90; 95% bootstrap CI [0.85, 0.96]), Aya-23-8B (0.75 [0.67, 0.83]), Qwen-2.5-7B (0.69 [0.59, 0.78]), and Gemma-2-9B (0.38 [0.27, 0.49]). For three models, the dominant Somali non-refusal mode is not fluent harmful compliance but unclear output: empty, wrong-language, or incoherent generations. The native verification spot-check achieves 100% agreement with the judge (Cohen's kappa = 1.00) on the 80 sampled rows. We report aggregate refusal rates, category gaps, and reliability statistics only; raw model generations are retained locally and are not released.
Summary / 总结
Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally.