Edge-specific signal propagation on mature chromophore-region 3D mechanism graphs for fluorescent protein quantum-yield prediction
Authors: Yuchen Xiong, Swee Keong Yeap, Steven Aw Yoong Kit
First: 2026-05-07T17:51:41+00:00 · Latest: 2026-05-11T16:18:20+00:00
Comments: Includes appendix; source code, processed feature tables and evaluation scripts are available from the first author upon reasonable request
Abstract
Fluorescent protein quantum yield (QY) is governed by the mature chromophore and its three-dimensional microenvironment rather than sequence identity alone. Protein language models and emission-band averages capture global trends, but do not model how local physical signals act on specific chromophore regions.
We present a chromophore-centred mechanism graph algorithm for QY prediction. Each PDB structure is converted into a typed 3D residue graph, registered to a mature-CRO state, partitioned into phenolate, bridge and imidazolinone regions, and transformed by channel-signal-region propagation. The representation contains 121 enrichment features; after removing identity shortcuts, 52 non-identity features are used for band-specific ExtraTrees regression. Because each feature encodes a contact channel, seed signal and target CRO region, interpretation is intrinsic rather than post hoc. On a 531-protein benchmark, the method achieved the best random-CV performance among model-based baselines (R = 0.772 +/- 0.008, MAE = 0.131 +/- 0.002), exceeding Band mean (R = 0.632), ESM-C (R = 0.734) and SaProt (R = 0.731), and ranked first in bright screening (Bright P@5 = 0.704). Under homology control, the advantage was clearest in the remote bucket (<50% similarity; R = 0.697 versus 0.633, 0.575 and 0.408), with the strongest overall bright/dark Top-K screening. Stable selected features recovered band-specific mechanisms: aromatic packing and clamp asymmetry in GFP-like proteins, charge/clamp balance in Red proteins, and flexibility-risk/bulky-contact features in Far-red proteins.
Source code, feature tables and evaluation scripts are available from the first author upon request. Contact: yuchenak05@gmail.com
Summary / 总结
Fluorescent protein quantum yield (QY) is governed by the mature chromophore and its three-dimensional microenvironment rather than sequence identity alone.
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
Authors: Jeonghye Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang
First: 2026-05-11T16:16:00+00:00 · Latest: 2026-05-11T16:16:00+00:00
Abstract
Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.
Summary / 总结
Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model.
Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition
Authors: David F. Ramirez, Tim L. Overman, Kristen Jaskie, Marv Kleine, Andreas Spanias
Venue: Proc. SPIE 13463, Automatic Target Recognition XXXV, 134630D (29 May 2025);
First: 2026-05-11T16:05:58+00:00 · Latest: 2026-05-11T16:05:58+00:00
Comments: Accepted to SPIE Defense + Commercial Sensing, Automatic Target Recognition XXXV
Abstract
Large language-vision models (LLVM), such as OpenAI's ChatGPT and GPT-4, have gained prominence as powerful tools for analyzing text and imagery. The merging of these data domains represents a significant paradigm shift with far-reaching implications for automatic target recognition (ATR). Recent transformer-based LLVM research has shown substantial improvements for geospatial perception tasks. Our study examines the application of LLVM to remote sensing image captioning and visual question-answering (VQA), with a specific focus on synthetic aperture radar (SAR) imagery. We examine newly published LLVM methods, including CLIP and LLaVA neural network transformer architectures. We have developed a work-in-progress SAR training and evaluation benchmark derived from the MSTAR Public Dataset. This has been extended to include descriptive text captions and question-answer pairs for VQA tasks. This challenge dataset is designed to push the boundaries of an LLVM in identifying nuanced ATR details in SAR imagery. Utilizing parameter-efficient fine-tuning, we train an LLVM method to identify fine-grained target qualities at 98% accuracy. We detail our data setup and experiments, addressing potential pitfalls that could lead to misleading conclusions. Accurately identifying and differentiating military vehicle types in SAR data poses a critical challenge, especially under complex environmental conditions. Mastering this target recognition skill may require a human analyst months of training and years of practice. This research represents a unique effort to apply LLVM to SAR applications, advancing machine-assisted remote sensing ATR for military and intelligence contexts.
Summary / 总结
Large language-vision models (LLVM), such as OpenAI's ChatGPT and GPT-4, have gained prominence as powerful tools for analyzing text and imagery.
Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning
Authors: Tao Hu, Da-Wei Zhou
First: 2026-05-11T15:59:06+00:00 · Latest: 2026-05-11T15:59:06+00:00
Abstract
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, yet real-world deployment often requires continual capability expansion across sequential tasks. In such scenarios, Multimodal Continual Instruction Tuning (MCIT) aims to acquire new capabilities while limiting catastrophic forgetting. Existing methods mainly follow a module-composition paradigm: they maintain task-level prompts or LoRA experts and dynamically route or aggregate a subset of them at inference. However, samples within the same task can still differ substantially in visual scenes, question intents, and reasoning demands. This motivates instance-level adaptation to individual query-image pairs rather than only selecting or combining task-level modules. To this end, we propose DRAPE (Dynamic Cross-Modal Prompt Generation), a prompt-learning framework that synthesizes continuous instance-specific soft prompts for MCIT. Instead of selecting prompts from a fixed pool, DRAPE derives prompt queries from the textual instruction and cross-attends to visual patch features, producing query-image conditioned prompts that are prepended to the frozen LLM. To mitigate forgetting during sequential updates, DRAPE applies null-space gradient projection to the shared projector and uses CLIP-based prototype routing for task-label-free generator selection at inference. Extensive experiments on MCIT benchmarks show that DRAPE achieves state-of-the-art performance among representative prompt-based and LoRA-based continual-learning baselines.
Summary / 总结
Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, yet real-world deployment often requires continual capability expansion across sequential tasks.
bViT: Investigating Single-Block Recurrence in Vision Transformers for Image Recognition
Authors: Michal Byra, Pawel Olszowiec, Grzegorz Stefanski, Grzegorz Gruszczynski, Alberto Presta
First: 2026-05-11T14:43:36+00:00 · Latest: 2026-05-11T14:43:36+00:00
Comments: 31 pages, 16 figures
Abstract
Vision Transformers (ViTs) are built by stacking independently parameterized blocks, but it remains unclear how much of this depth requires layer specific transformations and how much can be realized through recurrent computation. We study this question with bViT, a single-block recurrent ViT in which one transformer block is applied repeatedly to process an image. This architecture preserves the iterative structure of a deep ViT while removing layer specific block parameterization, providing a controlled setting for studying recurrence in vision. On ImageNet-1K, a 12-step bViT-B achieves accuracy comparable to standard ViT-B under the same training recipe and computational budget, while using an order of magnitude fewer parameters. We observe that recurrent performance improves with representation width, with wider bViTs recovering much more of the performance of standard ViTs than narrow variants. We interpret this behavior as implicit depth multiplexing, where a shared block expresses multiple step-dependent computations through the evolving hidden state. Beyond ImageNet classification, bViT transfers competitively to downstream tasks and enables parameter-efficient fine-tuning. Mechanistic analyses of activations, attention and step-specific pruning show that the shared block changes its effective behavior across recurrent steps rather than simply repeating the same computation. Our results suggest that a large fraction of ViT depth can be implemented through recurrent reuse, provided that the representation space is sufficiently wide.
Summary / 总结
Vision Transformers (ViTs) are built by stacking independently parameterized blocks, but it remains unclear how much of this depth requires layer specific transformations and how much can be realized through recurrent computation.
Holmes: A Benchmark to Assess the Linguistic Competence of Language Models
Authors: Andreas Waldis, Yotam Perlitz, Leshem Choshen, Yufang Hou, Iryna Gurevych
First: 2024-04-29T17:58:36+00:00 · Latest: 2026-05-11T14:37:50+00:00
Abstract
We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence - their unconscious understanding of linguistic phenomena. Specifically, we use classifier-based probing to examine LMs' internal representations regarding distinct linguistic phenomena (e.g., part-of-speech tagging). As a result, we meet recent calls to disentangle LMs' linguistic competence from other cognitive abilities, such as following instructions in prompting-based evaluations. Composing Holmes, we review over 270 probing studies and include more than 200 datasets to assess syntax, morphology, semantics, reasoning, and discourse phenomena. Analyzing over 50 LMs reveals that, aligned with known trends, their linguistic competence correlates with model size. However, surprisingly, model architecture and instruction tuning also significantly influence performance, particularly in morphology and syntax. Finally, we propose FlashHolmes, a streamlined version that reduces the computation load while maintaining high-ranking precision.
Summary / 总结
We introduce Holmes, a new benchmark designed to assess language models (LMs) linguistic competence - their unconscious understanding of linguistic phenomena.
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs
Authors: Krishak Aneja, Manas Mittal, Anmol Goel, Ponnurangam Kumaraguru, Vamshi Krishna Bonagiri
First: 2026-05-11T14:21:57+00:00 · Latest: 2026-05-11T14:21:57+00:00
Comments: 20 pages, 9 figures including appendix
Abstract
Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent misalignment (EM). While prior work links these failures to specific directions in the activation space, their relationship to the model's broader persona remains unexplored. We map the latent personality space of LLMs through established psychometric profiles like the Big Five, Dark Triad, and LLM-specific behaviors (e.g. evil, sycophancy), and show that the semantic geometry is highly stable across aligned models and their corrupted fine-tunes. Through causal interventions, we find that directions isolating social valence, such as the 'Evil' persona vector, and a Semantic Valence Vector (SVV) that we introduce, function as intrinsic guardrails: ablating them drives the misalignment rates above $40$%, while amplifying them suppresses the failure mode to less than $3$%. Leveraging the structural stability of the personality space, we show that vectors extracted $\textit{a priori}$ from an instruct-tuned model transfer zero-shot to successfully regulate EM in corrupted fine-tunes. Overall, our findings suggest that harmful fine-tuning does not overwrite a model's internal representation of personality, allowing conserved representations to serve as robust, cross-distribution guardrails.
Summary / 总结
Fine-tuning Large Language Models (LLMs) on benign narrow data can sometimes induce broad harmful behaviors, a vulnerability termed emergent misalignment (EM).
VISTA: A Generative Egocentric Video Framework for Daily Assistance
Authors: Yu-Hsiang Liu, Yu-Chien Tang, An-Zi Yen
First: 2026-05-11T13:50:47+00:00 · Latest: 2026-05-11T13:50:47+00:00
Comments: pre-print
Abstract
Training AI agents to proactively assist humans in daily activities, from routine household tasks to urgent safety situations, requires large-scale visual data. However, capturing such scenarios in the real world is often difficult, costly, or unsafe, and physics-based simulators lack the visual fidelity needed to transfer learned behaviors to real settings. Therefore, we introduce VISTA, a video synthesis system that produces high-fidelity egocentric videos as training and evaluation data for AI agents. VISTA employs a 5-step script generation pipeline with causal reverse reasoning to create diverse, logically grounded intervention modes. These scenarios span two levels of agent autonomy: reactive and proactive. In reactive modes, the user explicitly asks the agent for help. In proactive modes, the agent offers help without receiving a direct request. We further divide proactive modes into explicit and implicit types. In explicit proactive scenarios, the user is aware of needing help but does not directly address the agent. In implicit proactive scenarios, the agent intervenes before the user even realizes that help is needed. VISTA allows users to customize and refine scenarios to generate video benchmarks for daily tasks, offering a scalable and controllable alternative to real-world data collection for training and evaluating AI agents in realistic environments.
Summary / 总结
Training AI agents to proactively assist humans in daily activities, from routine household tasks to urgent safety situations, requires large-scale visual data.
Prompt Estimation from Prototypes for Federated Prompt Tuning of Vision Transformers
Authors: M Yashwanth, Sharannya Ghosh, Aditay Tripathi, Anirban Chakraborty
First: 2025-10-29T10:42:56+00:00 · Latest: 2026-05-11T12:40:44+00:00
Comments: Accepted to TMLR 2026
Abstract
Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data. Its parameter efficiency makes it particularly suitable for Federated Learning (FL), where both communication and computation budgets are often constrained. However, global prompt tuning struggles to generalize across heterogeneous clients, while personalized tuning overfits to local data and lacks generalization. We propose PEP-FedPT (Prompt Estimation from Prototypes for Federated Prompt Tuning), a unified framework designed to achieve both generalization and personalization in federated prompt tuning of ViTs. Within this framework, we introduce the novel Class-Contextualized Mixed Prompt (CCMP) - based on class-specific prompts maintained alongside a globally shared prompt. For each input, CCMP adaptively combines class-specific prompts using weights derived from global class prototypes and client class priors. This approach enables per-sample prompt personalization without storing client-dependent trainable parameters. The prompts are collaboratively optimized via traditional federated averaging technique on the same. Comprehensive evaluations on CIFAR-100, TinyImageNet, DomainNet, and iNaturalist datasets demonstrate that PEP-FedPT consistently surpasses the state-of-the-art baselines under diverse data heterogeneity scenarios, establishing a strong foundation for efficient and generalizable federated prompt tuning of Vision Transformers.
Summary / 总结
Visual Prompt Tuning (VPT) of pre-trained Vision Transformers (ViTs) has proven highly effective as a parameter-efficient fine-tuning technique for adapting large models to downstream tasks with limited data.
Learning to Learn the Macroscopic Fundamental Diagram using Physics-Informed and meta Machine Learning techniques
Authors: Amalie Roark, Serio Agriesti, Francisco Camara Pereira, Guido Cantelmo
Venue: Learning to learn the macroscopic fundamental diagram using physics-informed and model agnostic machine learning. Transportation Research Part C: Emerging Technologies, 2026, 189, 105707
First: 2025-08-19T12:59:58+00:00 · Latest: 2026-05-11T11:59:47+00:00
Comments: Version accepted for publication in Transportation Research Part C (before proof-reading)
Abstract
The Macroscopic Fundamental Diagram is a popular tool used to describe traffic dynamics in an aggregated way, with applications ranging from traffic control to incident analysis. However, estimating the MFD for a given network requires large numbers of loop detectors, which is not always available in practise. This article proposes a framework to alleviate the data scarcity challenge harnessing Meta-Learning, a subcategory of Machine Learning that trains models to understand and adapt to new tasks on their own. We use Meta-Learning to identify and exploit transferable patterns from data-rich cities to cities where not enough data is available to estimate the MFD. The developed model is trained and tested by leveraging data from multiple cities and exploiting it to model the MFD of other cities with different shares of detectors and topological structures. The proposed Meta-Learning framework is applied to an ad-hoc Multi-Task Physics-Informed Neural Network, specifically designed to estimate the MFD. Results show an average MAE improvement in flow prediction of around 50% across cities (depending on the subset of loop detectors tested). The Meta-Learning framework thus successfully generalises across diverse urban settings and improves performance on cities with limited data, demonstrating the potential of using Meta-Learning when a limited number of detectors is available. We directly test this assumption by applying the Meta-Learning outputs to unseen cities to simulate a real-life application scenario and the wide applicability of the proposed methodology. Finally, the proposed framework is validated against traditional Transfer Learning approaches and tested with FitFun, a model for FD estimation from the literature, to prove its transferability.
Summary / 总结
The Macroscopic Fundamental Diagram is a popular tool used to describe traffic dynamics in an aggregated way, with applications ranging from traffic control to incident analysis.
EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant
Authors: Zichen Wen, Boxue Yang, Junlong Ke, Jiajie Huang, Chenfei Liao, Junxi Wang, Xuyang Liu, Linfeng Zhang
First: 2026-05-11T10:43:59+00:00 · Latest: 2026-05-11T10:43:59+00:00
Comments: 33 pages, 9 figures
Abstract
Streaming video understanding demands more than watching longer videos: assistants must decide when to speak in real time, balancing responsiveness against verbosity. Yet most video-language models (VideoLLMs) are trained for offline inference, and existing streaming benchmarks externalize this timing decision to the evaluator. We address this gap with RealStreamEval, a frame-level multi-turn evaluation protocol that exposes models to sequential observations and penalizes unnecessary responses. Under this protocol, we observed that strong offline VideoLLMs retain useful visual understanding but lack an interaction policy for deciding when to respond. Motivated by this observation, we propose EvoStreaming, a self-evolved streaming adaptation framework in which the base model itself acts as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories without external supervision. With only $1{,}000$ self-generated samples ($139\times$ less than the leading streaming instruction-tuning approach) and no architectural changes, EvoStreaming consistently improves the overall RealStreamEval score by up to $10.8$ points across five open VideoLLM backbones (Qwen2/2.5/3-VL, InternVL-3.5, MiniCPM-V4.5) while largely preserving offline video performance. These results suggest that data-efficient interaction tuning is a practical path for adapting existing VideoLLMs to streaming assistants.
Summary / 总结
Streaming video understanding demands more than watching longer videos: assistants must decide when to speak in real time, balancing responsiveness against verbosity.
LPT: Less-overfitting Prompt Tuning for Vision-Language Model
Authors: Chenhao Ding, Xinyuan Gao, Songlin Dong, Jizhou Han, Qiang Wang, Zhengdong Zhou, Yuhang He, Yihong Gong
First: 2024-10-14T08:06:21+00:00 · Latest: 2026-05-11T10:40:10+00:00
Abstract
Vision-language models (VLMs) have demonstrated exceptional generalization capabilities for downstream tasks. Due to its efficiency, prompt learning has gradually become a more effective and efficient method for transferring VLMs to downstream tasks, surpassing traditional finetuning methods. However, during the transfer process, these models are prone to severe overfitting, leading to a significant decline in generalization ability. To address this issue, we propose a framework named LPT, specifically designed for vision-language models. Specifically, we use CLIP to filter out fine-grained foreground information that may lead to overfitting, thereby guiding the prompts with basic visual concepts. Additionally, to further mitigate overfitting, we have developed a Structural Preservation (SP) constraint at the feature level, which aligns the model's overall feature space structure with the frozen CLIP, endowing the feature space with overall plasticity and enabling effective reshaping of the feature space during optimization. Moreover, we employ Hierarchical Logit (HL) constraint at the output layer to constrain the overall class information in the output, complementing the role of SP at the output end. Extensive experiments across various benchmarks (from base-to-novel, cross-dataset transfer, and domain generalization) demonstrate that our approach significantly improves generalization capability and effectively alleviates overfitting compared to state-of-the-art methods.
Summary / 总结
Vision-language models (VLMs) have demonstrated exceptional generalization capabilities for downstream tasks.
MePo: Meta Post-Refinement for Rehearsal-Free General Continual Learning
Authors: Guanglong Sun, Hongwei Yan, Liyuan Wang, Zhiqi Kang, Shuang Cui, Hang Su, Jun Zhu, Yi Zhong
First: 2026-02-08T12:15:35+00:00 · Latest: 2026-05-11T10:23:54+00:00
Abstract
To cope with uncertain changes of the external world, intelligent systems must continually learn from complex, evolving environments and respond in real time. This ability, collectively known as general continual learning (GCL), encapsulates practical challenges such as online datastreams and blurry task boundaries. Although leveraging pretrained models (PTMs) has greatly advanced conventional continual learning (CL), these methods remain limited in reconciling the diverse and temporally mixed information along a single pass, resulting in sub-optimal GCL performance. Inspired by meta-plasticity and reconstructive memory in neuroscience, we introduce here an innovative approach named Meta Post-Refinement (MePo) for PTMs-based GCL. This approach constructs pseudo task sequences from pretraining data and develops a bi-level meta-learning paradigm to refine the pretrained backbone, which serves as a prolonged pretraining phase but greatly facilitates rapid adaptation of representation learning to downstream GCL tasks. MePo further initializes a meta covariance matrix as the reference geometry of pretrained representation space, enabling GCL to exploit second-order statistics for robust output alignment. MePo serves as a plug-in strategy that achieves significant performance gains across a variety of GCL benchmarks and pretrained checkpoints in a rehearsal-free manner (e.g., 15.10\%, 13.36\%, and 12.56\% on CIFAR-100, ImageNet-R, and CUB-200 under Sup-21/1K). Our source code is available at \href{https://github.com/SunGL001/MePo}{MePo}
Summary / 总结
To cope with uncertain changes of the external world, intelligent systems must continually learn from complex, evolving environments and respond in real time.
Extending Confidence-Based Text2Cypher with Grammar and Schema Aware Filtering
Authors: Makbule Gulcin Ozsoy
First: 2026-05-11T10:18:13+00:00 · Latest: 2026-05-11T10:18:13+00:00
Abstract
Large language models (LLMs) allow users to query databases using natural language by translating questions into executable queries. Despite strong progress on tasks such as Text2SQL, Text2SPARQL, and Text2Cypher, most existing methods focus on better prompting, fine-tuning, or iterative refinement. However, they often do not explicitly enforce structural constraints, such as syntactic validity and schema consistency. This can reduce reliability, since generated queries must satisfy both syntax rules and database schema constraints to be executable. In this work, we study how structured constraints can be used in test-time inference for Text2Cypher. We focus on post-generation validation to improve query correctness. We extend a confidence-based inference framework with a sequential filtering process that combines confidence scoring, grammar validation, and schema constraints before final aggregation. This lets us analyze how different constraint types affect generated queries. Our experiments with two instruction-tuned models show that grammar-based filtering improves syntactic validity. Schema-aware filtering further improves execution quality by enforcing consistency with the database structure. However, stronger filtering also increases the number of empty predictions and reduces execution coverage. Overall, we show that adding simple structural checks at test time improves the reliability of Text2Cypher generation, and we provide a clearer view of how syntax and schema constraints contribute differently.
Summary / 总结
Large language models (LLMs) allow users to query databases using natural language by translating questions into executable queries.
ProteinOPD: Towards Effective and Efficient Preference Alignment for Protein Design
Authors: Yulin Zhang, He Cao, Zihao Jiang, Chenyi Zi, Zhipeng Zhou, Zijing Liu, Yu Li, Jia Li, Ziqi Gao
First: 2026-05-11T08:38:51+00:00 · Latest: 2026-05-11T08:38:51+00:00
Abstract
Designing proteins with desired functions or properties represents a core goal in synthetic biology and drug discovery. Recent advances in protein language models (PLMs) have enabled the generation of highly designable protein sequences, while preference alignment provides a promising way to steer designs toward desired functions and properties. Nevertheless, they often trigger catastrophic forgetting of pretrained knowledge, degrading basic designability and failing to balance multiple competing objectives. To address these issues, we draw inspiration from On-Policy Distillation (OPD), an advanced post-training method renowned for mitigating catastrophic forgetting through its mode-seeking nature. In this work, we propose ProteinOPD, a multi-objective preference alignment framework that can effectively balance multiple preference objectives while maintaining the inherent designability of PLMs. ProteinOPD adapts a pretrained PLM into preference-specific teachers and distills their knowledge into a shared student via token-level OPD on the student's own trajectories. During this process, the student is aligned to a unique normalized geometric consensus of weighted teachers while ensuring bounded optimization under conflicts. This bridges the gap for OPD in multi-objective/teacher alignment. Extensive experiments show that ProteinOPD achieves substantial gains on target preference objectives without compromising the designability, with an 8x training speedup over RL-based alignment competitors.
Summary / 总结
Designing proteins with desired functions or properties represents a core goal in synthetic biology and drug discovery.
A Scalable Entity-Based Framework for Auditing Bias in LLMs
Authors: Akram Elbouanani, Aboubacar Tuo, Adrian Popescu
First: 2026-01-18T12:07:31+00:00 · Latest: 2026-05-11T08:27:39+00:00
Abstract
Existing approaches to bias evaluation in large language models (LLMs) trade ecological validity for statistical control, relying either on artificial prompts that poorly reflect real-world use or on naturalistic tasks that lack scale and rigor. We introduce a scalable bias-auditing framework that uses named entities as controlled probes to measure systematic disparities in model behavior. Synthetic data enables us to construct diverse, controlled inputs, and we show that it reliably reproduces bias patterns observed in natural text, supporting its use for large-scale analysis. Using this framework, we conduct the largest bias audit to date, comprising 1.9 billion data points across multiple entity types, tasks, languages, models, and prompting strategies. We find consistent patterns: models penalize right-wing politicians and favor left-wing politicians, prefer Western and wealthier countries over the Global South, favor Western companies, and penalize firms in the defense and pharmaceutical sectors. While instruction tuning reduces bias, increasing model scale amplifies it, and prompting in Chinese or Russian does not mitigate Western-aligned preferences. These findings highlight the need for systematic bias auditing before deploying LLMs in high-stakes applications. Our framework is extensible to other domains and tasks, and we make it publicly available to support future work.
Summary / 总结
Existing approaches to bias evaluation in large language models (LLMs) trade ecological validity for statistical control, relying either on artificial prompts that poorly reflect real-world use or on naturalistic tasks that lack scale and rigor.
SMOG: Scalable Meta-Learning for Multi-Objective Bayesian Optimization
Authors: Leonard Papenmeier, Petru Tighineanu
First: 2026-01-29T18:51:58+00:00 · Latest: 2026-05-11T07:32:08+00:00
Comments: 29 pages, 18 figures
Abstract
Multi-objective optimization aims to solve problems with competing objectives. Evaluating such problems is often slow or expensive, limiting the budget of evaluations. In many applications, historical data from related optimization tasks is available and can be leveraged via meta-learning to accelerate optimization. Bayesian optimization, as a promising technique for expensive black-box problems, has been extended independently to meta-learning and multi-objective optimization, but methods that simultaneously address both settings remain largely unexplored. We propose SMOG-a scalable and modular meta-learning model based on a multi-output Gaussian process-that explicitly learns correlations between objectives. SMOG builds a structured joint Gaussian process prior across meta- and target tasks and, after conditioning on metadata, yields a closed-form prior for the target task. This construction propagates metadata uncertainty into the target surrogate in a principled way. SMOG supports hierarchical, parallel training, achieving linear scaling with the number of meta-tasks. The resulting surrogate integrates seamlessly with standard multi-objective Bayesian optimization acquisition functions. We demonstrate that our method is consistently competitive, delivering strong data efficiency across representative benchmarks and applications.
Summary / 总结
Multi-objective optimization aims to solve problems with competing objectives.
Medical Incident Causal Factors and Preventive Measures Generation Using Tag-based Example Selection in Few-shot Learning
Authors: Yuna Haseyama, Tomoki Ito, Hiroki Sakaji, Itsuki Noda
First: 2026-05-11T05:49:09+00:00 · Latest: 2026-05-11T05:49:09+00:00
Abstract
In high-stakes domains such as healthcare, the reliability of Large Language Models (LLMs) is critical, particularly when generating clinical insights from incident reports. This study proposes a tag-based few-shot example selection method for prompting LLMs to generate background/causal factors and preventive measures from details of the medical incidents. For our experiments, we use the Japanese Medical Incident Dataset (JMID), a structured dataset of 3,884 real-world medical accident and near-miss reports. These reports are variably annotated with a wide range of tags--some include descriptive information (e.g., "medications," "blood transfusion therapy"). We compare three few-shot example selection strategies--random sampling, cosine similarity-based selection, and our proposed tag-based method--using GPT-4o and LLaMA 3.3. Results show that the tag-based approach achieves the highest precision and most stable generation behavior, while similarity-based selection often leads to unintended outputs and safety filter activation. These findings suggest that selecting examples based on human-interpretable dataset tags can improve generation precision and stability in clinical LLM applications.
Summary / 总结
In high-stakes domains such as healthcare, the reliability of Large Language Models (LLMs) is critical, particularly when generating clinical insights from incident reports.
Selective LoRA for Visual Tokens and Attention Heads
Authors: Tiange Luo, Lajanugen Logeswaran, Jaekyeom Kim, Justin Johnson, Honglak Lee
First: 2025-12-22T10:02:10+00:00 · Latest: 2026-05-11T05:06:06+00:00
Abstract
Low-rank adaptation (LoRA) is widely used for parameter-efficient fine-tuning, but its standard all-token, all-head design ignores the heterogeneous structure of vision language model (VLM) inputs. We introduce \emph{Image-LoRA}, a vision-oriented PEFT recipe that views LoRA as a token-level residual update and applies this update only to visual tokens. Image-LoRA further restricts adaptation to the value path of a compact subset of attention heads, selected using a one-pass influence estimate from a rank-1 visual-token-only probe. This token-, head-, and value-selective design reduces trainable parameters and adapter-only training FLOPs while leaving the pure-text forward pass of the frozen backbone unchanged when no visual tokens are present. Across visual localization benchmarks with controlled text:image token ratios, Image-LoRA matches or closely approaches standard LoRA, while showing especially favorable trade-offs in image-token-heavy regimes. We further validate its generality on TextVQA and VideoQA, verify pure-text preservation on GSM8K, and show on ViLP that a stronger information bottleneck can yield gains over standard LoRA.
Summary / 总结
Low-rank adaptation (LoRA) is widely used for parameter-efficient fine-tuning, but its standard all-token, all-head design ignores the heterogeneous structure of vision language model (VLM) inputs.
Representative Action Selection for Large Action Space Bandit Families
Authors: Quan Zhou, Mark Kozdoba, Shie Mannor
First: 2025-05-23T18:08:57+00:00 · Latest: 2026-05-11T02:35:33+00:00
Abstract
We study the problem of selecting a subset from a large action space shared by a family of bandits. In many natural situations, while the nominal set of actions is large, actions are highly correlated: many yield similar rewards across environments, making it wasteful to maintain the full set. Our aim is to understand whether it is possible -- and how -- to select a smaller set of representative actions that performs nearly as well as the full action space.
Our main contribution is a surprisingly simple algorithm: repeatedly sample a bandit instance at random, solve it, and collect the optimal action. This algorithm can significantly reduce the action space when such correlations are present, without the need to know a-priori the correlation structure. We provide theoretical guarantees on the performance of the algorithm and demonstrate its practical effectiveness through empirical comparisons with Combinatorial Bandit, Meta Learning Bandit and Zooming baselines.
Summary / 总结
We study the problem of selecting a subset from a large action space shared by a family of bandits.
Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations
Authors: Su-Hyeon Kim, Yo-Sub Han
First: 2026-05-11T02:01:18+00:00 · Latest: 2026-05-11T02:01:18+00:00
Abstract
Large language models from different families use different hidden dimensions, tokenizers, and training procedures, making behavioral directions difficult to compare or transfer across models. We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations, without fine-tuning or target-specific direction extraction. We evaluate five instruction-tuned model families and ten behavioral axes. We find that same-axis directions align tightly across the Llama-Qwen-Mistral-Phi (LQMP) cluster in ACS. This shared structure transfers to downstream tasks. For the aligned LQMP cluster, held-out targets achieve (0.83) ten-way detection accuracy and (0.95) mean binary AUROC, while canonical steering induces refusal-rate shifts of up to +0.46% under distribution shift. Sensitivity analyses show that two source models and small anchor pools already suffice to approximate transferable directions. Overall, ACS provides a novel perspective on cross-family interpretability, revealing that representation-level transfer remains robust across model families.
Summary / 总结
Large language models from different families use different hidden dimensions, tokenizers, and training procedures, making behavioral directions difficult to compare or transfer across models.
BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification
Authors: Yi-Siang Wang, Kuan-Yu Chen, Yu-Chen Den, Darby Tien-Hao Chang
First: 2026-05-07T12:27:18+00:00 · Latest: 2026-05-11T01:17:23+00:00
Comments: 19 pages, 4 figures
Abstract
Large language models (LLMs) have recently been adapted to tabular prediction by serializing structured features into natural language, but their performance in low-data regimes remains limited compared to gradient-boosted decision trees (GBDTs). In this work, we revisit the boosting paradigm, traditionally associated with tree ensembles, and ask whether it can be applied as a general training principle for LLM fine-tuning. We propose BoostLLM, a framework that transforms parameter-efficient fine-tuning into a multi-round residual optimization process by training sequential PEFT adapters as weak learners. To incorporate tabular inductive bias, BoostLLM integrates decision-tree paths as a second input view alongside raw features; analysis reveals that the path view acts as a structured teacher in early training steps before the model shifts toward feature-driven representations. Empirically, BoostLLM achieves consistent improvements over standard fine-tuning across multiple LLM backbones and datasets, matching or surpassing XGBoost across a wide range of shot counts and outperforming GPT-4o-based methods with a 4B model. We further show that the framework scales: pairing with stronger tree models and extended boosting horizons yields additional gains under appropriate stabilization. These results suggest that boosting can serve as a general training principle for LLM fine-tuning, particularly in low-data regimes for structured data.
Summary / 总结
Large language models (LLMs) have recently been adapted to tabular prediction by serializing structured features into natural language, but their performance in low-data regimes remains limited compared to gradient-boosted decision trees (GBDTs).
Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy, Efficiency, and Robustness
Authors: Pratik Jayarao, Himanshu Gupta, Neeraj Varshney, Chaitanya Dwivedi
Venue: NeurIPS
First: 2025-09-09T18:36:02+00:00 · Latest: 2026-05-10T19:19:49+00:00
Comments: Accepted in 2025 NeurIPS Foundations of Reasoning in Language Models Workshop
Abstract
As Large Language Models (LLMs) are increasingly adopted as automated judges in benchmarking and reward modeling, ensuring their reliability, efficiency, and robustness has become critical. In this work, we present a systematic comparison of "thinking" and "non-thinking" LLMs in the LLM-as-a-judge paradigm using open-source Qwen 3 models of relatively small sizes (0.6B, 1.7B, and 4B parameters). We evaluate both accuracy and computational efficiency (FLOPs) on RewardBench tasks, and further examine augmentation strategies for non-thinking models, including in-context learning, rubric-guided judging, reference-based evaluation, and n-best aggregation. Our results show that despite these enhancements, non-thinking models generally fall short of their thinking counterparts. Our results show that thinking models achieve approximately 10% points higher accuracy with little overhead (under 2x), in contrast to augmentation strategies like few-shot learning, which deliver modest gains at a higher cost (>8x). Bias and robustness analyses further demonstrate that thinking models maintain significantly greater consistency under a variety of bias conditions such as positional, bandwagon, identity, diversity, and random biases (6% higher on average). We further extend our experiments to the multilingual setting and our results confirm that explicit reasoning extends its benefits beyond English. Overall, our work results in several important findings that provide systematic evidence that explicit reasoning offers clear advantages in the LLM-as-a-judge paradigm not only in accuracy and efficiency but also in robustness.
Summary / 总结
As Large Language Models (LLMs) are increasingly adopted as automated judges in benchmarking and reward modeling, ensuring their reliability, efficiency, and robustness has become critical.
Learning Multi-Indicator Weights for Data Selection: A Joint Task-Model Adaptation Framework with Efficient Proxies
Authors: Jingze Song, Zihao Chen, Wenqing Chen, Zibin Zheng
Venue: IJCAI 2026
First: 2026-05-10T17:30:16+00:00 · Latest: 2026-05-10T17:30:16+00:00
Comments: This work has been accepted at IJCAI 2026
Abstract
Data selection is a key component of efficient instruction tuning for large language models, as recent work has shown that data quality often matters more than data quantity. Accordingly, prior studies have introduced various multi-dimensional heuristics to evaluate and filter instruction data. However, most existing methods rely on static task-agnostic and model-agnostic weighting schemes, which overlook the varying requirements of specific downstream tasks and the differing pre-existing capabilities of models. In this paper, we propose a framework for learning multi-indicator weights that jointly adapts data selection to both the downstream task and the specific model. Our method identifies optimal weight configurations without full-scale fine-tuning by utilizing in-context learning (ICL) signals on compact tiny-validation sets. These signals serve as efficient performance proxies that ensure high-fidelity evaluation at minimal computational cost. Experiments across multiple benchmarks and model families, including Mistral, Qwen, and Llama, show that the approach achieves performance comparable to or exceeding full-dataset tuning while using only 30\% of the training samples on GSM8K. Furthermore, our analysis reveals a trade-off between semantic diversity and logical complexity in reasoning tasks, highlighting the necessity of joint task-model adaptation.
Summary / 总结
Data selection is a key component of efficient instruction tuning for large language models, as recent work has shown that data quality often matters more than data quantity.
K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs
Authors: Hao Liang, Qihan Lin, Zhaoyang Han, Xiaochen Ma, Zhen Hao Wong, Meiyi Qiang, Linzhuang Sun, Wentao Zhang
First: 2026-05-10T16:24:26+00:00 · Latest: 2026-05-10T16:24:26+00:00
Abstract
Large language models (LLMs) are increasingly used in K-12 education, yet existing benchmarks such as C-Eval, CMMLU, GaokaoBench, and EduEval mainly evaluate factual recall through exam-style question answering. Effective educational AI additionally requires curriculum cognition: understanding how knowledge is structured through prerequisite chains, concept taxonomies, experiment-concept links, and pedagogical sequencing. To address this gap, we introduce K12-KGraph, a curriculum-aligned knowledge graph extracted from official People's Education Press textbooks across mathematics, physics, chemistry, and biology from primary to high school. The graph contains seven node types (Concept, Skill, Experiment, Exercise, Section, Chapter, Book) and nine relation types covering taxonomy, prerequisite, association, verification, assessment, location, and order. Based on this graph, we construct two resources: (1) K12-Bench, a 23,640-question multi-select benchmark spanning five graph-derived task families (Ground, Prereq, Neighbor, Evidence, and Locate); and (2) K12-Train, a KG-guided supervised fine-tuning corpus of approximately 2,300 QA pairs synthesized from graph structure and node attributes. Experiments reveal substantial deficiencies in curriculum cognition: on K12-Bench, Gemini-3-Flash achieves only 57% exact match, while the best open-source model, Gemma-4-31B-IT, reaches 46%. Under a strictly matched 2,300-sample SFT budget on Qwen3-4B-Base and Llama-3.1-8B-Base, K12-Train consistently outperforms equally sized subsets from eight mainstream instruction-tuning corpora on both GaokaoBench and EduEval, demonstrating that curriculum-structured supervision is highly sample-efficient for educational tuning. We release the graph, benchmark, training data, and full construction pipeline.
Summary / 总结
Large language models (LLMs) are increasingly used in K-12 education, yet existing benchmarks such as C-Eval, CMMLU, GaokaoBench, and EduEval mainly evaluate factual recall through exam-style question answering.
LLM-Driven Performance-Space Augmentation for Meta-Learning-Based Algorithm Selection
Authors: Darren Zhu, Daren Ler
First: 2026-05-10T13:00:43+00:00 · Latest: 2026-05-10T13:00:43+00:00
Abstract
Meta-learning for algorithm selection relies on a meta-dataset in which each row corresponds to a supervised learning dataset described by meta-features and labelled with a target value that is associated with algorithm choice (typically, some function of algorithm performance). A persistent limitation is that the number of curated real-world datasets is small, resulting in sparse meta-datasets that constrain meta-learner generalisation. In this paper, we address this problem by augmenting the meta-dataset with synthetic regression datasets produced via a large language model (LLM), with generation steered toward target regions of a low-dimensionality performance space. In our experiments, we adopt a two-dimensional geometric setting defined by the cross-validated $R^2$ scores of two anchor algorithms, known as landmarkers. We compare two augmentation strategies: (1) uniform sampling, which distributes synthetic datasets across the performance space; and (2) margin-based sampling, which concentrates them near the decision boundary where landmarker preference is most ambiguous. Across 42 real-world UCI regression datasets and 730 synthetic datasets, both strategies substantially improve meta-learner performance over the unaugmented baseline under regression and multi-label evaluation formulations. However, uniform augmentation consistently outperforms margin-based augmentation, achieving a 17.47% relative reduction in Hamming loss, a 100.41% relative improvement in subset accuracy, and a +6.09% relative gain in pooled out-of-fold $R^2$. These results lead us to postulate a central thesis: the performance of algorithms resides on a low-dimensional performance manifold, whose reconstruction bias may be minimised by user-guided LLMs that seek to maximise uniform $ε$-cover, and consequently, lead to improved meta-learning for algorithm selection.
Summary / 总结
Meta-learning for algorithm selection relies on a meta-dataset in which each row corresponds to a supervised learning dataset described by meta-features and labelled with a target value that is associated with algorithm choice (typically, some function of algorithm performance).
SEMASIA: A Large-Scale Dataset of Semantically Structured Latent Representations
Authors: Mario Edoardo Pandolfo, Enrico Grimaldi, Lorenzo Marinucci, Leonardo Di Nino, Simone Fiorellino, Sergio Barbarossa, Paolo Di Lorenzo
First: 2026-05-10T11:42:36+00:00 · Latest: 2026-05-10T11:42:36+00:00
Abstract
Latent representations learned by neural networks often exhibit semantic structure, where concept similarity is reflected by geometric proximity in embedding space. However, comparing such spaces across models remains difficult: changes in architecture, pretraining data, objective, or random seed can yield embeddings with similar content but incompatible geometry. This latent space alignment problem is central to interpretability, transfer and multimodal learning, federated systems, and semantic communication; however, progress remains limited by the lack of large-scale, model-diverse, and metadata-rich benchmarks. To address this gap, we introduce SEMASIA, a large-scale collection of latent representations extracted from approximately 1,700 pretrained vision models across eight standard image-classification benchmarks. SEMASIA pairs embeddings with structured metadata describing architectures, training regimes, pretraining sources, and model scale. We demonstrate three applications of the resource. First, we analyze the conceptual organization of individual latent spaces, showing consistent prototype-like clustering and hierarchical semantic neighborhoods across models and datasets. Second, we benchmark supervised alignment mappings between latent spaces using reconstruction error and downstream task performance. Third, we perform a large-scale regression analysis of how pretraining-data complexity, specialization, transfer learning, augmentation, and model scale relate to geometric and probing properties of embeddings. By coupling representational scale with standardized metadata, SEMASIA provides a reproducible foundation for studying latent geometry, evaluating alignment methods, and developing next-generation heterogeneous and interoperable AI systems.
Summary / 总结
Latent representations learned by neural networks often exhibit semantic structure, where concept similarity is reflected by geometric proximity in embedding space.
Let the Target Select for Itself: Data Selection via Target-Aligned Paths
Authors: Huitao Yang, Hengzhi He, Guang Cheng
First: 2026-05-10T08:07:09+00:00 · Latest: 2026-05-10T08:07:09+00:00
Abstract
Targeted data selection aims to identify training samples from a large candidate pool that improve performance on a specific downstream task. Many recent methods estimate candidate utility by aggregating local attribution scores along a trajectory induced by the candidate pool. When the pool is heterogeneous, however, this reference trajectory may be misaligned with the dynamics of a target-aligned selected subset, creating what we call reference path bias. We propose an alternative reference path: a validation-induced flow obtained from a short, capacity-limited warmup on the available target validation proxy. Along this path, candidates are scored by a normalized endpoint loss drop, yielding a simple zero-order selection rule that requires no candidate gradients or Hessian approximations. Across controlled logistic, vision, and instruction-tuning experiments, this score is competitive with strong dynamic attribution baselines while substantially reducing warmup and storage cost. Moreover, since the reference trajectory is decoupled from any specific candidate pool, the same compact warmup can be reused across additional pools without recomputing the trajectory.
Summary / 总结
Targeted data selection aims to identify training samples from a large candidate pool that improve performance on a specific downstream task.
Supervised Dimensionality Reduction Revisited: Why LDA on Frozen CNN Features Deserves a Second Look
Authors: Indar Kumar, Girish Karhana, Sai Krishna Jasti, Ankit Hemant Lade
First: 2026-04-05T01:39:28+00:00 · Latest: 2026-05-10T07:49:17+00:00
Comments: 11 pages, 5 figures, 5 tables. Code available at https://github.com/IndarKarhana/lda-image-classification
Abstract
Frozen pretrained image representations are widely used for transfer learning: a backbone is kept fixed, feature vectors are extracted, and a lightweight classifier is trained on top. This pipeline usually feeds the full feature vector to the classifier, even when the target task has far fewer classes than the pretraining task. We revisit a classical alternative: supervised dimensionality reduction with Linear Discriminant Analysis (LDA) before linear probing.
We evaluate ten dimensionality-reduction strategies on frozen features from six backbones -- ResNet-18, ResNet-50, MobileNetV3-Small, EfficientNet-B0, ViT-B/16, and DINOv2-ViT-S/14 -- across CIFAR-100, Tiny ImageNet, and CUB-200-2011. Under a fixed logistic-regression protocol, LDA improves accuracy over full features in 11 of 12 coarse-grained configurations, with gains up to 4.5 percentage points while reducing feature dimensionality by 48-87%. The same projection consistently hurts on fine-grained CUB-200, where full features win across all six backbones. This establishes a practical boundary condition: LDA is useful when class-level structure is coarse enough to be captured by mean-separating directions, but it can discard subtle cues needed for fine-grained recognition.
We also compare LDA with PCA, PCA+LDA, regularized LDA, Local Fisher Discriminant Analysis, Neighbourhood Components Analysis, and three lightweight LDA extensions. The results show that plain LDA offers the best accuracy-cost tradeoff for most coarse-grained settings, while more complex supervised reduction methods rarely justify their additional cost. Overall, the study provides concrete guidance for when post-hoc supervised projection should, and should not, be inserted into frozen-feature image classification pipelines.
Summary / 总结
Frozen pretrained image representations are widely used for transfer learning: a backbone is kept fixed, feature vectors are extracted, and a lightweight classifier is trained on top.
From Regression to Inference: Meta-Learning Predictors for Neural Architecture Search
Authors: Liping Deng, MingQing Xiao
First: 2026-05-10T03:34:39+00:00 · Latest: 2026-05-10T03:34:39+00:00
Abstract
Prediction-based approaches are widely used in neural architecture search (NAS), where a predictor estimates the performance of candidate architectures to guide selection. However, existing predictors are typically trained via supervised regression on limited samples, leading to overfitting and poor generalization to unseen architectures. In this work, we propose a fundamentally different formulation that models performance prediction as a conditional function inference problem using a Convolutional Neural Process (ConvNP) with meta-learning capabilities. Instead of fitting a fixed mapping to limited samples, our approach meta-learns to infer performance from partial observations by training with context-target splits across a group of synthesized tasks, explicitly optimizing for generalization under data scarcity and aligning the training procedure with the deployment setting in NAS. We further design simple yet effective meta-features for cell-based architectures and evaluate our method on NAS-Bench-101 and NAS-Bench-201. Extensive experiments show that our approach consistently improves top-K ranking quality and achieves the state-of-the-art architecture selection using limited samples.
Summary / 总结
Prediction-based approaches are widely used in neural architecture search (NAS), where a predictor estimates the performance of candidate architectures to guide selection.