Towards A Generative Protein Evolution Machine with DPLM-Evo
Authors: Xinyou Wang, Liang Hong, Jiasheng Ye, Zaixiang Zheng, Yu Li, Shujian Huang, Quanquan Gu
Venue: ICML 2026
First: 2026-04-30T19:59:07+00:00 · Latest: 2026-05-13T17:52:23+00:00
Comments: A peer-reviewed version was accepted to ICML 2026
Abstract
Proteins are shaped by gradual evolution under biophysical and functional constraints. Protein language models learn rich evolutionary constraints from large-scale sequences, and discrete diffusion-based protein language models~(\eg, DPLMs) are promising for both understanding and generation. However, existing DPLMs typically rely on masking-based absorbing diffusion that contradicts a simple biological intuition: proteins evolve through accumulated edits, not by emerging from masks. Consequently, these frameworks lack explicit pretraining objectives for substitution and insertion/deletion (indel) operations, limiting both optimization-style post-editing and flexible guided generation. To address these limitations, we present DPLM-Evo, an evolutionary discrete diffusion framework that explicitly predicts substitution, insertion, and deletion operations during denoising. DPLM-Evo decouples an upsampled-length latent alignment space from the variable-length observed sequence space, which makes indel-aware generation tractable and enables adaptive scaffold growth throughout the process with negligible computational overhead. To better align substitutions with real evolution, we further introduce a contextualized evolutionary noising kernel that produces biologically informed, context-dependent mutation patterns. Across tasks, DPLM-Evo improves sequence understanding and achieves state-of-the-art mutation effect prediction performance on ProteinGym in the single-sequence setting. It also enables variable-length simulated evolution, and post-editing/optimization of existing proteins via explicit edit trajectories.
Summary / 总结
Proteins are shaped by gradual evolution under biophysical and functional constraints.
ENSEMBITS: an alphabet of protein conformational ensembles
Authors: Kaiwen Shi, Carlos Oliver
First: 2026-05-13T17:08:41+00:00 · Latest: 2026-05-13T17:08:41+00:00
Abstract
Protein structure tokenizers (PSTs) are workhorses in protein language modeling, function prediction, and evolutionary analysis. However, existing PSTs only capture local geometry of static structures, and miss the correlated motions and alternative conformational states revealed by protein ensembles. Here we introduce Ensembits, the first tokenizer of protein conformational ensembles. Ensembits address challenges inherent to tokenizing dynamics: deriving informative geometric descriptors across conformations, permutation-invariance encoding of variable-size ensembles, and conquering sparsity in dynamics data. Trained with a Residual VQ-VAE using a frame distillation objective on a large molecular dynamics corpus, Ensembits outperforms all related methods on RMSF prediction, and is the strongest standalone structural tokenizer on an token-conditioned ANOVA test on per-residue motion amplitude. Ensembits further matches or exceeds static tokenizers on EC, GO, binding site/affinity prediction, and zero-shot mutation-effect prediction despite using far less pretraining data. Notably, the distillation objective enables Ensembits to predict dynamics token from one single predicted structure, which alleviates dynamics data sparsity. As the field moves from static structure prediction toward ensemble generation, Ensembits offer the discrete vocabulary needed to bring dynamics into protein language modeling and design.
Summary / 总结
Protein structure tokenizers (PSTs) are workhorses in protein language modeling, function prediction, and evolutionary analysis.
Understanding Catastrophic Forgetting In LoRA via Mean-Field Attention Dynamics
Authors: Hugo Koubbi, Louis Hernandez, Matthieu Boussard
Venue: ICML 2026
First: 2024-02-23T16:26:01+00:00 · Latest: 2026-05-13T16:26:02+00:00
Comments: New version accepted at ICML 2026, with new results and without previous results
Abstract
Low-Rank Adaptation (LoRA) is the dominant parameter-efficient fine-tuning method due to its favorable compute-performance trade-off, yet it suffers from catastrophic forgetting. We study forgetting through a tractable _mean-field self-attention_ toy model, where tokens evolve as an interacting particle system and LoRA acts as a low-rank perturbation. Using tools from partial differential equations and dynamical systems, we characterize regimes suggesting a phase transition between forgetting and non-forgetting behavior. We show that one phase transition appears with respect to the norm of the perturbation, and the other with respect to the depth of the Transformers. We further bound the time-to-deviation in terms of the perturbation size and spectral quantities, and corroborate the predicted trends with experiments and exploratory analyses on real models under LoRA fine-tuning.
Summary / 总结
Low-Rank Adaptation (LoRA) is the dominant parameter-efficient fine-tuning method due to its favorable compute-performance trade-off, yet it suffers from catastrophic forgetting.
RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning
Authors: Andrea Morandi
First: 2026-05-13T15:48:16+00:00 · Latest: 2026-05-13T15:48:16+00:00
Abstract
LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-Learn, Critique -- that promotes a single black-box LLM into an ensemble-of-thought judge with no fine-tuning, retrieval, or external tools. Stage 1 wraps the input in a fixed pedagogical scaffold porting the Feynman Learning Technique (study $\to$ teach $\to$ find gaps $\to$ simplify) into LLM prompting. Stage 2 draws N=10 independent candidate verdicts at temperature 0.4. Stage 3 acts as its own critic, cross-comparing the candidate set against the original question to emit one critiqued verdict at temperature 0. On JudgeBench-GPT (350 hard pairwise items), Claude 3.7 Sonnet's pairwise accuracy climbs from 64.6% (single-shot vanilla prompt) to 78.6% (RTLC critique-of-10) -- an absolute 14.0-percentage-point gain. RTLC also beats N=10 self-consistency majority voting (77.7%) and a zero-shot first candidate (74.0%). A clean three-step ablation attributes +9.4 pp to the Teach-to-Learn scaffold, +3.7 pp to N=10 marginalisation, and +0.9 pp to explicit critique. We discuss the cost-accuracy frontier (RTLC sits above self-consistency at every working point), the error-budget breakdown across the four JudgeBench categories (knowledge, reasoning, math, coding), and how RTLC composes orthogonally with post-hoc judge-score calibration, with the two interventions compounding multiplicatively in practice.
Summary / 总结
LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items.
Few-shot Multi-Task Learning of Linear Invariant Features with Meta Subspace Pursuit
Authors: Chaozhi Zhang, Lin Liu, Xiaoqun Zhang
Venue: CSIAM Transactions on Applied Mathematics (2026)
First: 2024-09-04T13:44:22+00:00 · Latest: 2026-05-13T15:34:31+00:00
Abstract
Data scarcity poses a serious threat to modern machine learning and artificial intelligence, as their practical success typically relies on the availability of big datasets. One effective strategy to mitigate the issue of insufficient data is to first harness information from other data sources possessing certain similarities in the study design stage, and then employ the multi-task or meta learning framework in the analysis stage. In this paper, we focus on multi-task (or multi-source) linear models whose coefficients across tasks share an invariant low-rank component, a popular structural assumption considered in the recent multi-task or meta learning literature. Under this assumption, we propose a new algorithm, called Meta Subspace Pursuit (abbreviated as Meta-SP), that provably learns this invariant subspace shared by different tasks. Under this stylized setup for multi-task or meta learning, we establish both the algorithmic and statistical guarantees of the proposed method. Extensive numerical experiments are conducted, comparing Meta-SP against several competing methods, including popular, off-the-shelf model-agnostic meta learning algorithms such as ANIL. These experiments demonstrate that Meta-SP achieves superior performance over the competing methods in various aspects.
Summary / 总结
Data scarcity poses a serious threat to modern machine learning and artificial intelligence, as their practical success typically relies on the availability of big datasets.
From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan
Authors: Lei Yang, Leiyu Pan, Bojian Xiong, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, Jianxiang Peng, Juesi Xiao, Tianyu Dong, Zhuowen Han, Zhuo Chen, Yuqi Ren, Deyi Xiong
First: 2025-07-12T08:54:05+00:00 · Latest: 2026-05-13T12:42:40+00:00
Abstract
Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages. Tibetan, despite its cultural significance and large speaker population, is still substantially underrepresented. In this work, we present a comprehensive pipeline for advancing Tibetan language modeling through large-scale data curation and continual pre-training. We construct a 72 GB high-quality Tibetan corpus, the largest to date, and adapt Qwen2.5-7B through balanced multilingual continual pre-training with Tibetan, Chinese, and English, followed by multilingual instruction tuning. To further scale capacity efficiently, we extend the dense model to a 50B-A10B Mixture-of-Experts architecture. Due to the absence of standardized Tibetan benchmarks, we build multiple evaluation datasets via high-quality translation and human verification. Experimental results show that both dense and MoE models consistently outperform existing open-source and Tibetan-focused models of similar scale across diverse tasks. Our work advances Tibetan-centric LLM research and provides transferable insights for extending LLMs to other low-resource languages. We will release the model weights, evaluation benchmarks, and detailed data processing documentation in the follow-up.
Summary / 总结
Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages.
Continual Learning with Multilingual Foundation Model
Authors: Barathi Ganesh HB, Michal Ptaszynski, Rene Melendez, Juuso Eronen
First: 2026-05-13T12:10:47+00:00 · Latest: 2026-05-13T12:10:47+00:00
Comments: Final Workshop of the 9th evaluation campaign EVALITA 2026
Abstract
This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non-reclamatory usage of LGBTQ+-related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross-linguistic variation in sentiment expression. It integrates data-driven model selection via cross-validation, semantic-preserving augmentation through back-translation, inductive transfer learning with dynamic epoch-level undersampling, and domain-specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM-RoBERTa selected as the foundation model based on macro-averaged F1 score. Data augmentation via GPT-4o-mini back-translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre-training, RUN 3 and RUN 4 are previous predictions refined via language-specific decision thresholds optimized via ROC analysis. Language-specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at https://github.com/rbg-research/MultiPRIDE-Evalita-2026.
Summary / 总结
This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse.
Teaching and Learning under Deductive Errors
Authors: Jan Arne Telle, Brigt Håvardstun, Jose Hernandez-Orallo
Venue: neurips
First: 2026-05-13T11:43:01+00:00 · Latest: 2026-05-13T11:43:01+00:00
Comments: 15 pages, preprint neurips
Abstract
Most models of machine teaching and learning assume the learner makes no errors in its internal deductive inference. However, humans and large language models in few-shot learning regimes are two important examples of learners where this does not hold. They fail on some consistency checks, and they can fail stochastically. In this paper we introduce a teaching and learning framework that takes these deductive errors into account. We specifically study the case of machine teaching, as different characterizations of the teacher can account for both machine teaching and learning. In an overhauled Probably Approximately Correct (PAC) setting, we study theoretically that, for some estimated error level, the teacher must find a PAC teaching set that with high probability will lead the learner to guess a hypothesis that is approximately correct. We study the computational complexity of six different problems related to computing optimal PAC teaching sets. We give XP algorithms parametrized by size of teaching set, with tight runtime bounds under standard complexity assumptions like ETH. These results are complemented with a small experimental study of which teaching and learning protocols can best represent the observed behavior in some LLM teaching sessions.
Summary / 总结
Most models of machine teaching and learning assume the learner makes no errors in its internal deductive inference.
Query-Conditioned Test-Time Self-Training for Large Language Models
Authors: Chaehee Song, Minseok Seo, Yeeun Seong, Doyi Kim, Changick Kim
First: 2026-05-13T11:27:40+00:00 · Latest: 2026-05-13T11:27:40+00:00
Comments: 17 pages, 4 figures
Abstract
Large language models (LLMs) are typically deployed with fixed parameters, and their performance is often improved by allocating more computation at inference time. While such test-time scaling can be effective, it cannot correct model misconceptions or adapt the model to the specific structure of an individual query. Test-time optimization addresses this limitation by enabling parameter updates during inference, but existing approaches either rely on external data or optimize generic self-supervised objectives that lack query-specific alignment. In this work, we propose Query-Conditioned Test-Time Self-Training (QueST), a framework that adapts model parameters during inference using supervision derived directly from the input query. Our key insight is that the input query itself encodes latent signals sufficient for constructing structurally related problem--solution pairs. Based on this, QueST generates such query-conditioned pairs and uses them as supervision for parameter-efficient fine-tuning at test time. The adapted model is then used to produce the final answer, enabling query-specific adaptation without any external data. Across seven mathematical reasoning benchmarks and the GPQA-Diamond scientific reasoning benchmark, QueST consistently outperforms strong test-time optimization baselines. These results demonstrate that query-conditioned self-training is an effective and practical paradigm for test-time adaptation in LLMs.
Summary / 总结
Large language models (LLMs) are typically deployed with fixed parameters, and their performance is often improved by allocating more computation at inference time.
Sample-Efficient Optimisation over the Outputs of Generative Models
Authors: Samuel Willis, Paul Duckworth, Jack Simons, Aleksandra Kalisz, Krisztina Sinkovics, Noam Ghenassia, Shikha Surana, Henry T. Oldroyd, Alexandru I. Stere, Dragos D Margineantu, Carl Henrik Ek, Henry Moss, Erik Bodin
First: 2025-09-28T10:50:06+00:00 · Latest: 2026-05-13T11:18:10+00:00
Abstract
Modern generative AI models, such as diffusion and flow matching models, can sample from rich data distributions. However, many applications, especially in science and engineering, require more than drawing samples from the model distribution: they require searching within this distribution for samples that optimise task-specific criteria. In this work, we propose O3 (Optimisation Over the Outputs of Generative Models), a method for sample-efficient black-box optimisation over continuous-variable diffusion and flow-matching models. O3 is built around surrogate latent spaces: low-dimensional Euclidean embeddings that can be extracted from a generative model without additional training. The resulting representations have controllable dimensionality and support the direct application of standard optimisation algorithms. We show, on image and protein design tasks, that surrogate-space optimisation finds substantially higher-scoring samples than standard sampling or optimisation in the original latent space. Our method is model- and optimiser-agnostic, incurs negligible additional cost over standard generation, and requires no retraining or fine-tuning of the generative model.
Summary / 总结
Modern generative AI models, such as diffusion and flow matching models, can sample from rich data distributions.
SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
Authors: Camile Lendering, Erkut Akdag, Egor Bondarev
Venue: CVPR 2026
First: 2026-02-26T13:52:57+00:00 · Latest: 2026-05-13T09:15:01+00:00
Comments: Accepted to CVPR 2026. Revised version with corrected AU-PRO evaluation and recomputed metrics
Abstract
Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category. Recent few-shot methods achieve strong results employing foundation-model features, but typically rely on memory banks, auxiliary datasets, or multi-modal tuning of vision-language models. We therefore question whether such complexity is necessary given the feature representations of vision foundation models. To answer this question, we introduce SubspaceAD, a training-free method, that operates in two simple stages. First, patch-level features are extracted from a small set of normal images by a frozen DINOv2 backbone. Second, a Principal Component Analysis (PCA) model is fit to these features to estimate the low-dimensional subspace of normal variations. At inference, anomalies are detected via the reconstruction residual with respect to this subspace, producing interpretable and statistically grounded anomaly scores. Despite its simplicity, SubspaceAD achieves state-of-the-art performance across one-shot and few-shot settings without training, prompt tuning, or memory banks. In the one-shot anomaly detection setting, SubspaceAD achieves image-level and pixel-level AUROC of 97.1% and 97.5% on the MVTec-AD dataset, and 93.2% and 98.2% on the VisA dataset, respectively, surpassing prior state-of-the-art results. Code and demo are available at https://github.com/CLendering/SubspaceAD.
Summary / 总结
Detecting visual anomalies in industrial inspection often requires training with only a few normal images per category.
Instructions Shape Production of Language, not Processing
Authors: Andreas Waldis, Leshem Choshen, Yufang Hou, Yotam Perlitz
First: 2026-05-11T20:21:04+00:00 · Latest: 2026-05-13T08:57:14+00:00
Abstract
Instructions trigger a production-centered mechanism in language models. Through a cognitively inspired lens that separates language processing and production, we reveal this mechanism as an asymmetry between the two stages by probing task-specific information layer-wise across five binary judgment tasks. Specifically, we measure how instruction tokens shape information both when sample tokens, the input under evaluation, are processed and when output tokens are produced. Across prompting variations, task-specific information in sample tokens remains largely stable and correlates only weakly with behavior, whereas the same information in output tokens varies substantially and correlates strongly with behavior. Attention-based interventions confirm this pattern causally: blocking instruction flow to all subsequent tokens reduces both behavior and information in output tokens, whereas blocking it only to sample tokens has minimal effect on either. The asymmetry generalizes across model families and tasks, and becomes sharper with model scale and instruction-tuning, both of which disproportionately affect the production stage. Our findings suggest that understanding model capabilities requires jointly assessing internals and behavior, while decomposing the internal perspective by token position to distinguish the processing of input tokens from the production of output tokens.
Summary / 总结
Instructions trigger a production-centered mechanism in language models.
PanoWorld: Towards Spatial Supersensing in 360$^\circ$ Panorama World
Authors: Changpeng Wang, Xin Lin, Junhan Liu, Yuheng Liu, Zhen Wang, Donglian Qi, Yunfeng Yan, Xi Chen
First: 2026-05-13T08:31:22+00:00 · Latest: 2026-05-13T08:31:22+00:00
Abstract
Multimodal large laboratory models (MLLMs) still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception. For navigation, robotic search, and 3D scene understanding, 360-degree panoramic sensing offers a form of supersensing by capturing the entire surrounding environment at once. However, existing MLLM pipelines typically decompose panoramas into multiple perspective views, leaving the spherical structure of equirectangular projection (ERP) largely implicit. In this paper, we study pano-native understanding, which requires an MLLM to reason over an ERP panorama as a continuous, observer-centered space. To this end, we first define the key abilities for pano-native understanding, including semantic anchoring, spherical localization, reference-frame transformation, and depth-aware 3D spatial reasoning. We then build a large-scale metadata construction pipeline that converts mixed-source ERP panoramas into geometry-aware, language-grounded, and depth-aware supervision, and instantiate these signals as capability-aligned instruction tuning data. On the model side, we introduce PanoWorld with Spherical Spatial Cross-Attention, which injects spherical geometry into the visual stream. We further construct PanoSpace-Bench, a diagnostic benchmark for evaluating ERP-native spatial reasoning. Experiments show that PanoWorld substantially outperforms both proprietary and open-source baselines on PanoSpace-Bench, H* Bench, and R2R-CE Val-Unseen benchmarks. These results demonstrate that robust panoramic reasoning requires dedicated pano-native supervision and geometry-aware model adaptation. All source code and proposed data will be publicly released.
Summary / 总结
Multimodal large laboratory models (MLLMs) still struggle with spatial understanding under the dominant perspective-image paradigm, which inherits the narrow field of view of human-like perception.
Continual Fine-Tuning of Large Language Models via Program Memory
Authors: Hung Le, Svetha Venkatesh
First: 2026-05-13T08:26:37+00:00 · Latest: 2026-05-13T08:26:37+00:00
Comments: 18 page, preprint
Abstract
Parameter-Efficient Fine-Tuning (PEFT), particularly Low-Rank Adaptation (LoRA), has become a standard approach for adapting Large Language Models (LLMs) under limited compute. However, in continual settings where models are updated sequentially with small datasets, conventional LoRA updates struggle to balance rapid adaptation and knowledge retention. Existing methods typically treat the low-rank space as a homogeneous update region, lacking mechanisms to regulate how short-term updates are consolidated over time. We propose a continual LoRA framework with \textbf{Pro}gram memory, inspired by \textbf{C}omplementary \textbf{L}earning Systems in neuroscience. Our approach, dubbed \textbf{ProCL}, organizes LoRA adapters into structured program memory slots that are dynamically retrieved through input-conditioned attention. This enables rapid and localized adaptation, encouraging similar inputs to reuse shared adapter regions while reserving unused capacity for future data. The slots are then combined with the underlying adapter, which maintains a distributed representation that gradually accumulates knowledge across tasks to balance plasticity and stability. Our method operates entirely within the LoRA parameterization and incurs no additional inference cost. Experiments on diverse benchmarks demonstrate improved retention and reduced catastrophic forgetting over other continual LoRA strategies.
Summary / 总结
Parameter-Efficient Fine-Tuning (PEFT), particularly Low-Rank Adaptation (LoRA), has become a standard approach for adapting Large Language Models (LLMs) under limited compute.
A$_3$B$_2$: Adaptive Asymmetric Adapter for Alleviating Branch Bias in Vision-Language Image Classification with Few-Shot Learning
Authors: Yiyun Zhou, Zhonghua Jiang, Wenkang Han, Kunxi Li, Mingjing Xu, Chang Yao, Jingyuan Chen
Venue: IJCAI 2026
First: 2026-05-13T08:24:55+00:00 · Latest: 2026-05-13T08:24:55+00:00
Comments: Accepted by IJCAI 2026
Abstract
Efficient transfer learning methods for large-scale vision-language models ($e.g.$, CLIP) enable strong few-shot transfer, yet existing adaptation methods follow a fixed fine-tuning paradigm that implicitly assumes a uniform importance of the image and text branches, which has not been systematically studied in image classification. Through extensive analysis, we reveal a Branch Bias issue in vision-language image classification: adapting the image encoder does not always improve performance under out-of-distribution settings. Motivated by this observation, we propose A$_3$B$_2$, an Adaptive Asymmetric Adapter that alleviates Branch Bias in few-shot learning. A$_3$B$_2$ introduces Uncertainty-Aware Adapter Dampening (UAAD), which automatically suppresses image-branch adaptation when prediction uncertainty is high, enabling soft and data-driven control without manual intervention. Architecturally, A$_3$B$_2$ adopts a lightweight asymmetric design inspired by mixture-of-experts with Load Balancing Regularization. Extensive experiments on three few-shot image classification tasks across 11 datasets demonstrate that A$_3$B$_2$ consistently outperforms 11 competitive prompt- and adapter-based baselines.
Summary / 总结
Efficient transfer learning methods for large-scale vision-language models ($e.g.$, CLIP) enable strong few-shot transfer, yet existing adaptation methods follow a fixed fine-tuning paradigm that implicitly assumes a uniform importance of the image and text branches, which has not been systematically studied in image classification.
When to Transfer: Adaptive Source Selection for Positive Transfer in Linear Models
Authors: Hamza Cherkaoui, Hélène Halconruy, Yohan Petetin
First: 2025-10-19T20:03:48+00:00 · Latest: 2026-05-13T08:20:57+00:00
Abstract
In many business settings, task-specific labeled data are scarce or costly to obtain, limiting supervised learning on a target task. A classical response is transfer learning (TL). Many TL works study how to transfer information from related sources. We study, for linear regression and classification, when to transfer via sample sharing: in a multi-source setting, we greedily decide from which sources and how many samples to incorporate into the target dataset. Our method uses an accept/reject rule based on a data-dependent estimate of the transfer gain, i.e the marginal decrease in target predictive error, computed conditionally on the observed target samples. We analyze our approach and show that how the derived statistical test enforces positive transfer with high probability. Under additional standard conditions, we also study the transfer gain itself and characterize when transfer is beneficial. Experiments on synthetic and real data show consistent gains over classical and recent strong baselines while avoiding negative transfer.
Summary / 总结
In many business settings, task-specific labeled data are scarce or costly to obtain, limiting supervised learning on a target task.
GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives
Authors: Alexandre Le Mercier, Chris Develder, Thomas Demeester
First: 2026-05-09T16:07:23+00:00 · Latest: 2026-05-13T07:49:42+00:00
Comments: 46 pages, 16 figures
Abstract
In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero-shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co-evolved imposter strategies. Our contributions are threefold: (1) Using chess as a substrate deep reasoning problem and Gemini 3.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; (2) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable (50.5% F1-score with a Gemini-based detector); (3) We show that zero-shot evaluation can be highly misleading for adaptive adversaries: two detectors with near-identical zero-shot scores differ by 8x on few-shot adaptation, while the meta-learned variant converges 20x faster, a gap only visible in the recalibration mode. Altogether, GAMBIT provides the first multi-agent benchmark where adversarial attacks and defenses co-evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system. Code and data: https://anonymous.4open.science/r/gambit.
Summary / 总结
In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses.
Preserve-Then-Quantize: Balancing Rank Budgets for Quantization Error Reconstruction in LLMs
Authors: Yoonjun Cho, Dongjae Jeon, Soeun Kim, Moongyu Jeon, Albert No
Venue: ICML 2026
First: 2026-02-02T12:02:21+00:00 · Latest: 2026-05-13T07:15:58+00:00
Comments: Accepted at ICML 2026. Project page: https://ai-isl.github.io/srr
Abstract
Quantization Error Reconstruction (QER) reduces accuracy loss in Post-Training Quantization (PTQ) by approximating weights as $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$, using a rank-$r$ correction to reconstruct quantization error. Prior methods devote the full rank budget to error reconstruction, which is suboptimal when $\mathbf{W}$ has intrinsic low-rank structure and quantization corrupts dominant directions. We propose Structured Residual Reconstruction (SRR), a rank-allocation framework that preserves the top-$k$ singular subspace of the activation-scaled weight before quantization, quantizes only the residual, and uses the remaining rank $r-k$ for error reconstruction. We derive a theory-guided criterion for selecting $k$ by balancing quantization-exposed energy and unrecoverable error under rank constraints. We further show that resulting $\mathbf{Q} + \mathbf{L}\mathbf{R}$ parameterization naturally supports Quantized Parameter-Efficient Fine-Tuning (QPEFT), and stabilizes fine-tuning via gradient scaling along preserved directions. Experiments demonstrate consistent perplexity reductions across diverse models and quantization settings in PTQ, along with a 5.9 percentage-point average gain on GLUE under 2-bit QPEFT. The project page is available at https://ai-isl.github.io/srr.
Summary / 总结
Quantization Error Reconstruction (QER) reduces accuracy loss in Post-Training Quantization (PTQ) by approximating weights as $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$, using a rank-$r$ correction to reconstruct quantization error.
Does language matter for spoken word classification? A multilingual generative meta-learning approach
Authors: Batsirayi Mupamhi Ziki, Louise Beyers, Ruan van der Merwe
First: 2026-05-13T06:54:51+00:00 · Latest: 2026-05-13T06:54:51+00:00
Abstract
Meta-learning has been shown to have better performance than supervised learning for few-shot monolingual spoken word classification. However, the meta-learning approach remains under-explored in multilingual spoken word classification. In this paper, we apply the Generative Meta-Continual Learning algorithm to spoken word classification. The generative nature of this algorithm makes it viable for use in application, and the meta-learning aspect promotes generalisation, which is crucial in a multilingual setting. We train monolingual models on English, German, French, and Catalan, a bilingual model on English and German, and a multilingual model on all four languages. We find that although the multilingual model performs best, the differences between model performance is unexpectedly low. We also find that the hours of unique data seen during training seems to be a stronger performance indicator than the number of languages included in the training data.
Summary / 总结
Meta-learning has been shown to have better performance than supervised learning for few-shot monolingual spoken word classification.
Controlling Logical Collapse in LLMs via Algebraic Ontology Projection over F2
Authors: Hisashi Miyashita, Mgnite Inc
First: 2026-05-13T04:01:29+00:00 · Latest: 2026-05-13T04:01:29+00:00
Abstract
Do large language models internally encode ontological relations in a formally verifiable algebraic structure? We introduce Algebraic Ontology Projection (AOP), which projects LLM hidden states into the Galois Field F2 under Liskov Substitution Principle constraints, using only 42 relational pairs as algebraic keys. AOP achieves up to 93.33% zero-shot inclusion accuracy on unseen concept pairs (Gemma-2 Instruct with optimized prompt), with consistent 86.67% accuracy observed across multiple model families -- with no model tuning, but through prompt alone.
This algebraic structure is strongly layer-dependent. We introduce Semantic Crystallisation (SC), a metric that quantifies F2 constraint satisfaction relative to a random baseline and predicts zero-shot accuracy without held-out data. System prompts act as algebraic boundary conditions: only their combination with instruction tuning prevents Late-layer Collapse -- a systematic degradation of logical consistency in the final layers, observed in 7 of 10 conditions. These findings reframe forward computation as an iterative process of algebraic organisation, and open a path toward LLMs whose logical structure is not merely approximated, but formally accessible.
Summary / 总结
Do large language models internally encode ontological relations in a formally verifiable algebraic structure?
Protein Circuit Tracing via Cross-layer Transcoders
Authors: Darin Tsui, Kunal Talreja, Daniel Saeedi, Amirali Aghazadeh
Venue: ICML 2026
First: 2026-02-12T14:57:57+00:00 · Latest: 2026-05-13T01:22:30+00:00
Comments: Accepted into ICML 2026. 32 pages, 17 figures
Abstract
Protein language models (pLMs) have emerged as powerful predictors of protein structure and function. However, the computational circuits underlying their predictions remain poorly understood. Recent mechanistic interpretability methods decompose pLM representations into interpretable features, but they treat each layer independently and thus fail to capture cross-layer computation, limiting their ability to approximate the full model. We introduce ProtoMech, a framework for discovering computational circuits in pLMs using cross-layer transcoders that learn sparse latent representations jointly across layers to capture the model's full computational circuitry. Applied to the pLM ESM2, ProtoMech recovers 82-89% of the original performance on protein family classification and function prediction tasks. ProtoMech then identifies compressed circuits that use <1% of the latent space while retaining up to 79% of model accuracy, revealing correspondence with structural and functional motifs, including binding, signaling, and stability. Steering along these circuits enables high-fitness protein design, surpassing baseline methods in more than 70% of cases. These results establish ProtoMech as a principled framework for protein circuit tracing.
Summary / 总结
Protein language models (pLMs) have emerged as powerful predictors of protein structure and function.
Small Area Estimation of Case Growths for Timely COVID-19 Outbreak Detection
Authors: Zhaowei She, Zilong Wang, Jagpreet Chhatwal, Turgay Ayer
First: 2023-12-07T07:53:00+00:00 · Latest: 2026-05-13T00:34:41+00:00
Comments: Equal contributions by co-first authors Zhaowei She, Zilong Wang (in alphabetical order)
Abstract
The COVID-19 pandemic has exerted a profound impact on the global economy and continues to exact a significant toll on human lives. The COVID-19 case growth rate stands as a key epidemiological parameter to estimate and monitor for effective detection and containment of the resurgence of outbreaks. A fundamental challenge in growth rate estimation and hence outbreak detection is balancing the accuracy-speed tradeoff, where accuracy typically degrades with shorter fitting windows. In this paper, we provide a transfer learning framework, which we call Transfer Learning Random Forest (TLRF), for an effective implementation of the random forests algorithm that balances this accuracy-speed tradeoff. Specifically, we develop an identification strategy that converts the growth rate estimation problem into a regression task, which enables effective transfer learning across space and time through random forests' adaptive weighting mechanism. As such, through adaptively choosing fitting window sizes based on relevant day-level and county-level features affecting the disease spread, TLRF can accurately estimate case growth rates for counties with small sample sizes. Out-of-sample prediction analysis shows that TLRF outperforms established growth rate estimation methods. Furthermore, we conducted a case study based on outbreak case data from the state of Colorado and showed that TLRF could improve timely detections of outbreaks up to 224% when compared to the decisions made by Colorado's Department of Health and Environment (CDPHE). To demonstrate practical implementation, we developed a publicly available outbreak detection tool that operated from September 2020 through March 2023, receiving substantial attention from policymakers across all 50 states.
Summary / 总结
The COVID-19 pandemic has exerted a profound impact on the global economy and continues to exact a significant toll on human lives.
Test-Time Training with KV Binding Is Secretly Linear Attention
Authors: Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, Ruilong Li
Venue: ICML 2026
First: 2026-02-24T18:59:30+00:00 · Latest: 2026-05-12T22:37:21+00:00
Comments: ICML 2026, Webpage: https://research.nvidia.com/labs/sil/projects/tttla/
Abstract
Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity. Project page: https://research.nvidia.com/labs/sil/projects/tttla/.
Summary / 总结
Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time.
Before the Last Token: Diagnosing Final-Token Safety Probe Failures
Authors: Shravan Doda
First: 2026-05-12T20:30:24+00:00 · Latest: 2026-05-12T20:30:24+00:00
Comments: 8 pages, 2 figures, 7 tables
Abstract
Final-token safety probes monitor a single hidden state after prompt prefill, but jailbreak prompts can contain probe-visible unsafe evidence distributed across earlier user-token representations that is missed by this readout. We study this prefill-time failure mode using SafeSwitch-style probes trained only on clean harmful and benign prompts across three instruction-tuned LLMs. The probes achieve high recall on clean harmful prompts, but miss many jailbreaks and can produce false positives on safety-adjacent benign prompts. Subspace analyses suggest that missed jailbreaks differ from clean benign prompts along directions that are poorly captured by the probe's representational subspace, and increasing probe bottleneck width does not reliably resolve this mismatch. Token-level prefill analyses reveal that probe-visible unsafe evidence often appears earlier in the sequence but is not exposed at the final-token readout, while naive max-pooling over token positions overfires on safe prompts. A simple PCA-HMM trajectory model, trained only on the same clean split, recovers many final-token misses from user-content prefill trajectories without the catastrophic false-positive behavior of naive token pooling, motivating trajectory-aware hidden-state analyses as diagnostic complements to final-token probes
Summary / 总结
Final-token safety probes monitor a single hidden state after prompt prefill, but jailbreak prompts can contain probe-visible unsafe evidence distributed across earlier user-token representations that is missed by this readout.
Is SAM3 ready for pathology segmentation?
Authors: Qiuyu Kong, Shakiba Sharifi, Yiming Wang, Marco Cristani, Zanxi Ruan
First: 2026-04-20T13:10:07+00:00 · Latest: 2026-05-12T20:12:53+00:00
Comments: accept to icip2026
Abstract
Is Segment Anything Model 3 (SAM3) capable in segmenting Any Pathology Images? Digital pathology segmentation spans tissue-level and nuclei-level scales, where traditional methods often suffer from high annotation costs and poor generalization. SAM3 introduces Promptable Concept Segmentation, offering a potential automated interface via text prompts. With this work, we propose a systematic evaluation protocol to explore the capability space of SAM3 in a structured manner. Specifically, we evaluate SAM3 under different supervision settings including zero-shot, few-shot, and supervised with varying prompting strategies. Our extensive evaluation on pathological datasets including NuInsSeg, PanNuke and GlaS, reveals that: 1.text-only prompts poorly activate nuclear concepts; 2.performance is highly sensitive to visual prompt types and budgets; 3.few-shot learning offers gains, but SAM3 lacks robustness against visual prompt noise; and 4.a significant gap persists between prompt-based usage and task-trained adapter-based reference. Our study delineates SAM3's boundaries in pathology image segmentation and provides practical guidance on the necessity of pathology domain adaptation.
Summary / 总结
Is Segment Anything Model 3 (SAM3) capable in segmenting Any Pathology Images?
Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer
Authors: Robert Belanec, Simon Ostermann, Ivan Srba, Maria Bielikova
First: 2024-08-02T09:00:03+00:00 · Latest: 2026-05-12T18:55:05+00:00
Abstract
Prompt tuning is an efficient solution for training large language models (LLMs). However, current soft-prompt-based methods often sacrifice multi-task modularity, requiring the training process to be fully or partially repeated for each newly added task. While recent work on task vectors applied arithmetic operations on full model weights to achieve the desired multi-task performance, a similar approach for soft-prompts is still missing. To this end, we introduce Task Prompt Vectors, created by element-wise difference between weights of tuned soft-prompts and their random initialization. Experimental results on 12 NLU datasets show that task prompt vectors can be used in low-resource settings to effectively initialize prompt tuning on similar tasks. In addition, we show that task prompt vectors are independent of the random initialization of prompt tuning on 2 different language model architectures. This allows prompt arithmetics with the pre-trained vectors from different tasks. In this way, we provide a competitive alternative to state-of-the-art baselines by arithmetic addition of task prompt vectors from multiple tasks.
Summary / 总结
Prompt tuning is an efficient solution for training large language models (LLMs).
PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark
Authors: Robert Belanec, Branislav Pecher, Ivan Srba, Maria Bielikova
First: 2025-11-26T11:18:06+00:00 · Latest: 2026-05-12T18:52:26+00:00
Abstract
Despite the state-of-the-art performance of Large Language Models (LLMs) achieved on many tasks, their massive scale often leads to high computational and environmental costs, limiting their accessibility. Parameter-Efficient Fine-Tuning (PEFT) methods address this challenge by reducing the number of trainable parameters while maintaining strong downstream performance. Despite the advances in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce. To bridge this gap, we introduce PEFT-Bench, a unified end-to-end benchmark for evaluating diverse PEFT methods on autoregressive LLMs. We demonstrate its usage across 27 NLP datasets and 7 PEFT methods. To account for different PEFT training and inference factors, we also introduce the PEFT Soft Cost Penalties (PSCP) metric, which takes trainable parameters, inference speed, and training memory usage into account.
Summary / 总结
Despite the state-of-the-art performance of Large Language Models (LLMs) achieved on many tasks, their massive scale often leads to high computational and environmental costs, limiting their accessibility.
PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models
Authors: Robert Belanec, Ivan Srba, Maria Bielikova
First: 2025-12-02T13:44:41+00:00 · Latest: 2026-05-12T18:50:51+00:00
Abstract
Parameter-Efficient Fine-Tuning (PEFT) methods address the increasing size of Large Language Models (LLMs). Currently, many newly introduced PEFT methods are challenging to replicate, deploy, or compare with one another. To address this, we introduce PEFT-Factory, a unified framework for efficient fine-tuning LLMs using both off-the-shelf and custom PEFT methods. While its modular design supports extensibility, it natively provides a representative set of 19 PEFT methods, 27 classification and text generation datasets addressing 12 tasks, and both standard and PEFT-specific evaluation metrics. As a result, PEFT-Factory provides a ready-to-use, controlled, and stable environment, improving replicability and benchmarking of PEFT methods. PEFT-Factory is a downstream framework that originates from the popular LLaMA-Factory, and is publicly available at https://github.com/kinit-sk/PEFT-Factory.
Summary / 总结
Parameter-Efficient Fine-Tuning (PEFT) methods address the increasing size of Large Language Models (LLMs).
Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs
Authors: Guinan Su, Yanwu Yang, Xueyan Li, Jonas Geiping
First: 2026-05-12T17:47:41+00:00 · Latest: 2026-05-12T17:47:41+00:00
Comments: Preprint, 37 pages. Code at https://github.com/seal-rg/streaming/
Abstract
The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications. However, the core of these systems has not changed much since early instruction-tuned models like ChatGPT. Even advanced AI agents function on message exchange formats, successively exchanging messages with users, systems, with itself (i.e. chain-of-thought) and tools in a single stream of computation. This bottleneck to a single stream in chat models leads to a number of limitations: the agent cannot act (generate output) while reading, and in reverse, cannot react to new information while writing. Similarly, the agent cannot act while thinking and cannot think while reading or acting on information.
In this work, we show that models can be unblocked by switching from instruction-tuning for sequential message formats to instruction-tuning for multiple, parallel streams of computation, splitting each role into a separate stream. Every forward pass of the language model then simultaneously reads from multiple input streams and generates tokens in multiple output streams, all of which causally depend on earlier timesteps. We argue that this data-driven change remedies a number of usability limitations as outlined above, improves model efficiency through parallelization, improves model security through better separation of concerns and can further improve model monitorability.
Summary / 总结
The continued improvements in language model capability have unlocked their widespread use as drivers of autonomous agents, for example in coding or computer use applications.
MetaColloc: Optimization-Free PDE Solving via Meta-Learned Basis Functions
Authors: Zichuan Yang
First: 2026-05-12T16:36:12+00:00 · Latest: 2026-05-12T16:36:12+00:00
Comments: 21 pages, 5 figures, 6 tables
Abstract
Solving partial differential equations (PDEs) with machine learning typically requires training a new neural network for every new equation. This optimization is slow. We introduce MetaColloc. It is an optimization-free and data-free framework that removes this bottleneck completely. We decouple basis discovery from the solving process. We meta-train a dual-branch neural network on diverse Gaussian Random Fields. This offline process creates a universal dictionary of neural basis functions. At test time, we freeze the network. We solve the PDE by assembling a collocation matrix. We find the solution through a single linear least squares step. For non-linear PDEs, we apply the Newton-Raphson method to achieve fast quadratic convergence. Our experiments across six 2D and 3D PDEs show massive improvements. MetaColloc reaches state-of-the-art accuracy on smooth and non-linear problems. It also reduces test-time computation by several orders of magnitude. Finally, we provide a detailed frequency sweep analysis. This analysis reveals a critical mismatch between function approximation and operator stability at extremely high frequencies. This profound finding opens a clear path toward future operator-aware meta-learning.
Summary / 总结
Solving partial differential equations (PDEs) with machine learning typically requires training a new neural network for every new equation.