Structural Causal Bottleneck Models
Authors: Simon Bing, Jonas Wahl, Jakob Runge
First: 2026-03-09T17:50:10+00:00 · Latest: 2026-03-09T17:50:10+00:00
Abstract
We introduce structural causal bottleneck models (SCBMs), a novel class of structural causal models. At the core of SCBMs lies the assumption that causal effects between high-dimensional variables only depend on low-dimensional summary statistics, or bottlenecks, of the causes. SCBMs provide a flexible framework for task-specific dimension reduction while being estimable via standard, simple learning algorithms in practice. We analyse identifiability in SCBMs, connect them to information bottlenecks in the sense of Tishby & Zaslavsky (2015), and illustrate how to estimate them experimentally. We also demonstrate the benefit of bottlenecks for effect estimation in low-sample transfer learning settings. We argue that SCBMs provide an alternative to existing causal dimension reduction frameworks like causal representation learning or causal abstraction learning.
Summary / 总结
We introduce structural causal bottleneck models (SCBMs), a novel class of structural causal models.
Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? A Study of Hierarchical Gating and Calibration
Authors: Víctor Yeste, Paolo Rosso
First: 2026-01-31T21:50:35+00:00 · Latest: 2026-03-09T17:41:24+00:00
Comments: Code: https://github.com/VictorMYeste/human-value-detection, models: https://huggingface.co/papers/2602.00913, 27 pages, 4 figures
Abstract
Human value detection from single sentences is a sparse, imbalanced multi-label task. We study whether Schwartz higher-order (HO) categories help this setting on ValueEval'24 / ValuesML (74K English sentences) under a compute-frugal budget. Rather than proposing a new architecture, we compare direct supervised transformers, hard HO$\rightarrow$values pipelines, Presence$\rightarrow$HO$\rightarrow$values cascades, compact instruction-tuned large language models (LLMs), QLoRA, and low-cost upgrades such as threshold tuning and small ensembles. HO categories are learnable: the easiest bipolar pair, Growth vs. Self-Protection, reaches Macro-$F_1=0.58$. The most reliable gains come from calibration and ensembling: threshold tuning improves Social Focus vs. Personal Focus from $0.41$ to $0.57$ ($+0.16$), transformer soft voting lifts Growth from $0.286$ to $0.303$, and a Transformer+LLM hybrid reaches $0.353$ on Self-Protection. In contrast, hard hierarchical gating does not consistently improve the end task. Compact LLMs also underperform supervised encoders as stand-alone systems, although they sometimes add useful diversity in hybrid ensembles. Under this benchmark, the HO structure is more useful as an inductive bias than as a rigid routing rule.
Summary / 总结
Human value detection from single sentences is a sparse, imbalanced multi-label task.
Exploring Embedding Priors in Prompt-Tuning for Improved Interpretability and Control
Authors: Sergey Sedov, Sumanth Bharadwaj Hachalli Karanam, Venu Gopal Kadamba
First: 2024-12-24T18:18:52+00:00 · Latest: 2026-03-09T17:23:47+00:00
Abstract
Prompt-Tuning is an efficient method for adapting pre-trained language models to new tasks with minimal computational overhead by modifying prompt embeddings. In this work, we investigate how crucial the phenomenon of embedding collapse, frequently observed in Prompt-Tuning, is for the final performance of the model. To address this question, we designed embedding priors and compared them with posteriors of the converged Soft and Deep Prompt-Tuning methods. Our findings suggest that priors strongly affect the position of the tuned embeddings, and models can effectively work with embeddings from different parts of activation spaces, including completely new regions. As the final Prompt-Tuning capabilities are limited, we hypothesize that controllable Prompt-Tuning posteriors may serve as a good starting point for tasks such as chain-of-thought (COT) distillation. Our experiments also show that generated trajectories are not localized in the activation space of the models. However, there are distinct clusters of activations for distant tasks (e.g., NLP and arithmetic), while activations between NLP tasks (e.g., Question-Answering and MLM) lie in the same cluster. These observations raise questions about the importance of a single activation cluster for the generalization abilities of large language models.
Summary / 总结
Prompt-Tuning is an efficient method for adapting pre-trained language models to new tasks with minimal computational overhead by modifying prompt embeddings.
PostTrainBench: Can LLM Agents Automate LLM Post-Training?
Authors: Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, Maksym Andriushchenko
First: 2026-03-09T17:18:00+00:00 · Latest: 2026-03-09T17:18:00+00:00
Abstract
AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities. This raises a deeper question: can these systems extend their capabilities to automate AI research itself? In this paper, we explore post-training, the critical phase that turns base LLMs into useful assistants. We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU). We ask frontier agents (e.g., Claude Code with Opus 4.6) to optimize the performance of a base LLM on a particular benchmark (e.g., Qwen3-4B on AIME). Importantly, we do not provide any predefined strategies to the agents and instead give them full autonomy to find necessary information on the web, run experiments, and curate data. We find that frontier agents make substantial progress but generally lag behind instruction-tuned LLMs from leading providers: 23.2% for the best agent vs. 51.1% for official instruction-tuned models. However, agents can exceed instruction-tuned models in targeted scenarios: GPT-5.1 Codex Max achieves 89% on BFCL with Gemma-3-4B vs. 67% for the official model. We also observe several failure modes worth flagging. Agents sometimes engage in reward hacking: training on the test set, downloading existing instruction-tuned checkpoints instead of training their own, and using API keys they find to generate synthetic data without authorization. These behaviors are concerning and highlight the importance of careful sandboxing as these systems become more capable. Overall, we hope PostTrainBench will be useful for tracking progress in AI R&D automation and for studying the risks that come with it. Website and code are available at https://posttrainbench.com/.
Summary / 总结
AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities.
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers
Authors: Fotios Lygerakis, Ozan Özdenizci, Elmar Rückert
First: 2025-05-26T14:19:29+00:00 · Latest: 2026-03-09T14:16:16+00:00
Abstract
Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-stage spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based architecture for learning task-agnostic visuotactile representations from paired vision and tactile inputs. Our key idea is a two-stage positional injection: local (modality-specific) positional encodings are added within each stream, and a global positional encoding is added on the joint token sequence immediately before attention, providing a shared positional vocabulary at the stage where cross-modal interaction occurs. We make the positional injection points explicit and conduct controlled ablations that isolate their effect before a token-wise nonlinearity versus immediately before self-attention. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of ViTaPEs in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success. Project page: https://sites.google.com/view/vitapes
Summary / 总结
Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force.
Meta-RL with Shared Representations Enables Fast Adaptation in Energy Systems
Authors: Théo Zangato, Aomar Osmani, Pegah Alizadeh
First: 2026-03-09T14:15:51+00:00 · Latest: 2026-03-09T14:15:51+00:00
Comments: accepted at PAKDD 2026, Hong Kong
Abstract
Meta-Reinforcement Learning addresses the critical limitations of conventional Reinforcement Learning in multi-task and non-stationary environments by enabling fast policy adaptation and improved generalization. We introduce a novel Meta-RL framework that integrates a bi-level optimization scheme with a hybrid actor-critic architecture specially designed to enhance sample efficiency and inter-task adaptability. To improve knowledge transfer, we meta-learn a shared state feature extractor jointly optimized across actor and critic networks, providing efficient representation learning and limiting overfitting to individual tasks or dominant profiles. Additionally, we propose a parameter-sharing mechanism between the outer- and inner-loop actor networks, to reduce redundant learning and accelerate adaptation during task revisitation. The approach is validated on a real-world Building Energy Management Systems dataset covering nearly a decade of temporal and structural variability, for which we propose a task preparation method to promote generalization. Experiments demonstrate effective task adaptation and better performance compared to conventional RL and Meta-RL methods.
Summary / 总结
Meta-Reinforcement Learning addresses the critical limitations of conventional Reinforcement Learning in multi-task and non-stationary environments by enabling fast policy adaptation and improved generalization.
SPOT: An Annotated French Corpus and Benchmark for Detecting Critical Interventions in Online Conversations
Authors: Manon Berriche, Célia Nouri, Chloée Clavel, Jean-Philippe Cointet
First: 2025-11-10T18:54:40+00:00 · Latest: 2026-03-09T10:54:27+00:00
Abstract
We introduce SPOT (Stopping Points in Online Threads), the first annotated corpus translating the sociological concept of stopping point into a reproducible NLP task. Stopping points are ordinary critical interventions that pause or redirect online discussions through a range of forms (irony, subtle doubt or fragmentary arguments) that frameworks like counterspeech or social correction often overlook. We operationalize this concept as a binary classification task and provide reliable annotation guidelines. The corpus contains 43,305 manually annotated French Facebook comments linked to URLs flagged as false information by social media users, enriched with contextual metadata (article, post, parent comment, page or group, and source). We benchmark fine-tuned encoder models (CamemBERT) and instruction-tuned LLMs under various prompting strategies. Results show that fine-tuned encoders outperform prompted LLMs in F1 score by more than 10 percentage points, confirming the importance of supervised learning for emerging non-English social media tasks. Incorporating contextual metadata further improves encoder models F1 scores from 0.75 to 0.78. We release the anonymized dataset, along with the annotation guidelines and code in our code repository, to foster transparency and reproducible research.
Summary / 总结
We introduce SPOT (Stopping Points in Online Threads), the first annotated corpus translating the sociological concept of stopping point into a reproducible NLP task.
HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination
Authors: Tingting Chen, Beibei Lin, Zifeng Yuan, Qiran Zou, Hongyu He, Anirudh Goyal, Yew-Soon Ong, Dianbo Liu
First: 2025-10-17T13:00:32+00:00 · Latest: 2026-03-09T08:44:04+00:00
Abstract
As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanations-not just a single correct answer-becomes critical. Many scientific problems are underdetermined: multiple, mechanistically distinct hypotheses are consistent with the same observations. We introduce HypoSpace, a diagnostic suite that treats LLMs as samplers of finite hypothesis sets and measures three complementary indicators: Validity (precision of proposals consistent with observations), Uniqueness (non-redundancy among proposals), and Recovery (coverage of the enumerated admissible set). We instantiate HypoSpace in three structured domains with deterministic validators and exactly enumerated hypothesis spaces: (i) causal graphs from perturbations, (ii) gravity-constrained 3D voxel reconstruction from top-down projections, and (iii) Boolean genetic interactions. Across instruction-tuned and reasoning-focused models, Validity often remains high while Uniqueness and Recovery degrade as the admissible space grows, revealing mode collapse that is invisible to correctness-only metrics. HypoSpace offers a controlled probe-rather than a leaderboard-for methods that explicitly explore and cover admissible explanation spaces. Code is available at: https://github.com/CTT-Pavilion/_HypoSpace.
Summary / 总结
As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanations-not just a single correct answer-becomes critical.
Stabilized Fine-Tuning with LoRA in Federated Learning: Mitigating the Side Effect of Client Size and Rank via the Scaling Factor
Authors: Jiayu Huang, Xiaohu Wu, Tiantian He, Qicheng Lao
First: 2026-03-09T07:49:56+00:00 · Latest: 2026-03-09T07:49:56+00:00
Abstract
Large Language Models (LLMs) are pivotal in natural language processing. The impracticality of full fine-tuning has prompted Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA), optimizing low-rank matrices A and B. In distributed scenarios where privacy constraints necessitate Federated Learning (FL), however, the integration of LoRA is often unstable. Specifically, we identify that aggregating updates from multiple clients introduces statistical variance that scales with the client count, causing gradient collapse when using high-rank adapters. Existing scaling factor candidates, such as the one used by Rank-Stabilized LoRA, ignore the interaction caused by the aggregation process. To bridge this gap, this paper introduces Stabilized Federated LoRA (SFed-LoRA), a framework that theoretically characterizes the interaction between adapter rank and federated aggregation. We derive an optimal scaling factor designed to effectively mitigate the aggregation error accumulating across N clients. By correcting the scaling mismatch inherent in previous approaches, SFed-LoRA restores the efficacy of high-rank adaptation without altering the original model architecture or increasing inference latency. Extensive experiments in diverse tasks, model architectures, and heterogeneous data distributions are conducted to validate our results. We demonstrate that SFed-LoRA prevents high-rank collapse, and achieves significantly improved stability and faster convergence compared with state-of-the-art baselines for high-rank adaptation.
Summary / 总结
Large Language Models (LLMs) are pivotal in natural language processing.
CeRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion
Authors: Hung-Hsuan Chen
First: 2026-02-26T11:55:25+00:00 · Latest: 2026-03-09T06:40:14+00:00
Abstract
Low-Rank Adaptation (LoRA) dominates parameter-efficient fine-tuning (PEFT). However, it faces a critical ``linear ceiling'' in complex reasoning tasks: simply increasing the rank yields diminishing returns due to intrinsic linear constraints. We introduce CeRA (Capacity-enhanced Rank Adaptation), a weight-level parallel adapter that injects SiLU gating and structural dropout to induce manifold expansion. On the SlimOrca benchmark, CeRA breaks this linear barrier: at rank 64 (PPL 3.89), it outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency. This advantage generalizes to mathematical reasoning, where CeRA achieves a perplexity of 1.97 on MathInstruct, significantly surpassing LoRA's saturation point of 2.07. Mechanism analysis via Singular Value Decomposition (SVD) confirms that CeRA activates the dormant tail of the singular value spectrum, effectively preventing the rank collapse observed in linear methods.
Summary / 总结
Low-Rank Adaptation (LoRA) dominates parameter-efficient fine-tuning (PEFT).
RadDiff: Retrieval-Augmented Denoising Diffusion for Protein Inverse Folding
Authors: Jin Han, Tianfan Fu, Wu-Jun Li
First: 2025-11-28T07:32:15+00:00 · Latest: 2026-03-09T04:52:28+00:00
Abstract
Protein inverse folding, the design of an amino acid sequence based on a target protein structure, is a fundamental problem of computational protein engineering. Existing methods either generate sequences without leveraging external knowledge or relying on protein language models~(PLMs). The former omits the knowledge stored in natural protein data, while the latter is parameter-inefficient and inflexible to adapt to ever-growing protein data. To overcome the above drawbacks, in this paper we propose a novel method, called $\underline{\text{r}}$etrieval-$\underline{\text{a}}$ugmented $\underline{\text{d}}$enoising $\underline{\text{diff}}$usion~($\mbox{RadDiff}$), for protein inverse folding. In RadDiff, a novel retrieval-augmentation mechanism is designed to capture the up-to-date protein knowledge. We further design a knowledge-aware diffusion model that integrates this protein knowledge into the diffusion process via a lightweight module. Experimental results on the CATH, TS50, and PDB2022 datasets show that $\mbox{RadDiff}$ consistently outperforms existing methods, improving sequence recovery rate by up to 19\%. Experimental results also demonstrate that RadDiff generates highly foldable sequences and scales effectively with database size.
Summary / 总结
Protein inverse folding, the design of an amino acid sequence based on a target protein structure, is a fundamental problem of computational protein engineering.
UniCast: A Unified Framework for Instance-Conditioned Multimodal Time-Series Forecasting
Authors: Sehyuk Park, Soyeon Caren Han, Eduard Hovy
First: 2025-08-16T07:33:27+00:00 · Latest: 2026-03-09T04:33:40+00:00
Abstract
Time series forecasting underpins applications in finance, healthcare, and environmental monitoring. Despite the success of Time Series Foundation Models (TSFMs), existing approaches operate in a unimodal setting and rely on static prompts or fixed fusion schemes, limiting their ability to exploit multimodal context and adapt to instance-level variation. We propose UniCast, a parameter-efficient multimodal framework that extends TSFMs through instance conditioned prompting and dynamic modality routing. UniCast infers a conditional prompt from time series, vision, and text inputs via a Transformer-based contextual distiller, enabling input-specific adaptation without updating the forecasting backbone. To regulate how auxiliary modalities influence predictions, UniCast employs Modality Routing, a cross-attention mechanism that estimates modality relevance given the current temporal state and selectively amplifies informative signals while suppressing noise. Integrated with a frozen TSFM via soft prompt tuning, UniCast preserves foundation-level generalization while enabling effective multimodal control. Extensive experiments across diverse forecasting benchmarks show that UniCast consistently outperforms all existing TSFM baselines, demonstrating that instance-conditioned multimodal control is critical for next-generation time series forecasting.
Summary / 总结
Time series forecasting underpins applications in finance, healthcare, and environmental monitoring.
Robust Transfer Learning with Side Information
Authors: Akram S. Awad, Shihab Ahmed, Yue Wang, George K. Atia
First: 2026-03-09T03:29:44+00:00 · Latest: 2026-03-09T03:29:44+00:00
Abstract
Robust Markov Decision Processes (MDPs) address environmental shift through distributionally robust optimization (DRO) by finding an optimal worst-case policy within an uncertainty set of transition kernels. However, standard DRO approaches require enlarging the uncertainty set under large shifts, which leads to overly conservative and pessimistic policies.
In this paper, we propose a framework for transfer under environment shift that derives a robust target-domain policy via estimate-centered uncertainty sets, constructed through constrained estimation that integrates limited target samples with side information about the source-target dynamics. The side information includes bounds on feature moments, distributional distances, and density ratios, yielding improved kernel estimates and tighter uncertainty sets.
The side information includes bounds on feature moments, distributional distances, and density ratios, yielding improved kernel estimates and tighter uncertainty sets.
Error bounds and convergence results are established for both robust and non-robust value functions. Moreover, we provide a finite-sample guarantee on the learned robust policy and analyze the robust sub-optimality gap. Under mild low-dimensional structure on the transition model, the side information reduces this gap and improves sample efficiency. We assess the performance of our approach across OpenAI Gym environments and classic control problems, consistently demonstrating superior target-domain performance over state-of-the-art robust and non-robust baselines.
Summary / 总结
Robust Markov Decision Processes (MDPs) address environmental shift through distributionally robust optimization (DRO) by finding an optimal worst-case policy within an uncertainty set of transition kernels.
LoFT: Low-Rank Adaptation That Behaves Like Full Fine-Tuning
Authors: Nurbek Tastan, Stefanos Laskaridis, Martin Takac, Karthik Nandakumar, Samuel Horvath
Venue: ICLR 2026
First: 2025-05-27T14:54:24+00:00 · Latest: 2026-03-09T00:00:10+00:00
Comments: Accepted to ICLR 2026
Abstract
Large pre-trained models are commonly adapted to downstream tasks using parameter-efficient fine-tuning methods such as Low-Rank Adaptation (LoRA), which injects small trainable low-rank matrices instead of updating all weights. While LoRA dramatically reduces trainable parameters with little overhead, it can still underperform full fine-tuning in accuracy and often converges more slowly. We introduce LoFT, a novel low-rank adaptation method that behaves like full fine-tuning by aligning the optimizer's internal dynamics with those of updating all model weights. LoFT not only learns weight updates in a low-rank subspace (like LoRA) but also properly projects the optimizer's first and second moments (Adam's momentum and variance) into the same subspace, mirroring full-model updates. By aligning the low-rank update itself with the full update, LoFT eliminates the need for tuning extra hyperparameters, e.g., the LoRA scaling factor $α$. Empirically, this approach substantially narrows the performance gap between adapter-based tuning and full fine-tuning and consistently outperforms standard LoRA-style methods, all without increasing inference cost.
Summary / 总结
Large pre-trained models are commonly adapted to downstream tasks using parameter-efficient fine-tuning methods such as Low-Rank Adaptation (LoRA), which injects small trainable low-rank matrices instead of updating all weights.
Transferable Optimization Network for Cross-Domain Image Reconstruction
Authors: Yunmei Chen, Chi Ding, Xiaojing Ye
First: 2026-03-08T22:20:03+00:00 · Latest: 2026-03-08T22:20:03+00:00
Comments: 30 pages, 7 figures
Abstract
We develop a novel transfer learning framework to tackle the challenge of limited training data in image reconstruction problems. The proposed framework consists of two training steps, both of which are formed as bi-level optimizations. In the first step, we train a powerful universal feature-extractor that is capable of learning important knowledge from large, heterogeneous data sets in various domains. In the second step, we train a task-specific domain-adapter for a new target domain or task with only a limited amount of data available for training. Then the composition of the adapter and the universal feature-extractor effectively explores feature which serve as an important component of image regularization for the new domains, and this leads to high-quality reconstruction despite the data limitation issue. We apply this framework to reconstruct under-sampled MR images with limited data by using a collection of diverse data samples from different domains, such as images of other anatomies, measurements of various sampling ratios, and even different image modalities, including natural images. Experimental results demonstrate a promising transfer learning capability of the proposed method.
Summary / 总结
We develop a novel transfer learning framework to tackle the challenge of limited training data in image reconstruction problems.
Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation
Authors: David Beauchemin, Richard Khoury
Venue: ICLR 2026
First: 2026-03-08T22:02:32+00:00 · Latest: 2026-03-08T22:02:32+00:00
Comments: Publish at the Advances in Financial AI: Towards Agentic and Responsible Systems Workshop @ ICLR 2026
Abstract
The digitization of insurance distribution in the Canadian province of Quebec, accelerated by legislative changes such as Bill 141, has created a significant "advice gap", leaving consumers to interpret complex financial contracts without professional guidance. While Large Language Models (LLMs) offer a scalable solution for automated advisory services, their deployment in high-stakes domains hinges on strict legal accuracy and trustworthiness. In this paper, we address this challenge by introducing AEPC-QA, a private gold-standard benchmark of 807 multiple-choice questions derived from official regulatory certification (paper) handbooks. We conduct a comprehensive evaluation of 51 LLMs across two paradigms: closed-book generation and retrieval-augmented generation (RAG) using a specialized corpus of Quebec insurance documents. Our results reveal three critical insights: 1) the supremacy of inference-time reasoning, where models leveraging chain-of-thought processing (e.g. o3-2025-04-16, o1-2024-12-17) significantly outperform standard instruction-tuned models; 2) RAG acts as a knowledge equalizer, boosting the accuracy of models with weak parametric knowledge by over 35 percentage points, yet paradoxically causing "context distraction" in others, leading to catastrophic performance regressions; and 3) a "specialization paradox", where massive generalist models consistently outperform smaller, domain-specific French fine-tuned ones. These findings suggest that while current architectures approach expert-level proficiency (~79%), the instability introduced by external context retrieval necessitates rigorous robustness calibration before autonomous deployment is viable.
Summary / 总结
The digitization of insurance distribution in the Canadian province of Quebec, accelerated by legislative changes such as Bill 141, has created a significant "advice gap", leaving consumers to interpret complex financial contracts without professional guidance.
Flow Matching Meets Biology and Life Science: A Survey
Authors: Zihao Li, Zhichen Zeng, Xiao Lin, Feihao Fang, Yanru Qu, Zhe Xu, Zhining Liu, Xuying Ning, Tianxin Wei, Ge Liu, Hanghang Tong, Jingrui He
First: 2025-07-23T17:44:29+00:00 · Latest: 2026-03-08T21:57:08+00:00
Comments: Nature Portfolio Journal Artificial Intelligence, 34 pages
Abstract
Over the past decade, advances in generative modeling, such as generative adversarial networks, masked autoencoders, and diffusion models, have significantly transformed biological research and discovery, enabling breakthroughs in molecule design, protein generation, catalysis discovery, drug discovery, and beyond. At the same time, biological applications have served as valuable testbeds for evaluating the capabilities of generative models. Recently, flow matching has emerged as a powerful and efficient alternative to diffusion-based generative modeling, with growing interest in its application to problems in biology and life sciences. This paper presents the first comprehensive survey of recent developments in flow matching and its applications in biological domains. We begin by systematically reviewing the foundations and variants of flow matching, and then categorize its applications into three major areas: biological sequence modeling, molecule generation and design, and peptide and protein generation. For each, we provide an in-depth review of recent progress. We also summarize commonly used datasets and software tools, and conclude with a discussion of potential future directions. The corresponding curated resources are available at https://github.com/Violet24K/Awesome-Flow-Matching-Meets-Biology.
Summary / 总结
Over the past decade, advances in generative modeling, such as generative adversarial networks, masked autoencoders, and diffusion models, have significantly transformed biological research and discovery, enabling breakthroughs in molecule design, protein generation, catalysis discovery, drug discovery, and beyond.
Test-Time Meta-Adaptation with Self-Synthesis
Authors: Zeyneb N. Kaya, Nick Rui
Venue: ICLR 2026
First: 2026-03-03T21:16:18+00:00 · Latest: 2026-03-08T21:53:33+00:00
Comments: 5 pages, 2 figures, 1 table. Accepted to AI with Recursive Self-Improvement (RSI) Workshop @ ICLR 2026
Abstract
As strong general reasoners, large language models (LLMs) encounter diverse domains and tasks, where the ability to adapt and self-improve at test time is valuable. We introduce MASS, a meta-learning framework that enables LLMs to self-adapt by generating problem-specific synthetic training data and performing targeted self-updates optimized for downstream performance at inference time. We train this behavior end-to-end via bilevel optimization: an inner loop adapts on self-generated examples while an outer loop meta-learns data-attribution signals and rewards post-update task performance. The synthetic data is optimized with scalable meta-gradients, backpropagating the downstream loss through the inner updates to reward useful generations. Experiments on mathematical reasoning show that MASS learns to synthesize per-instance curricula that yield effective, data-efficient test-time adaptation.
Summary / 总结
As strong general reasoners, large language models (LLMs) encounter diverse domains and tasks, where the ability to adapt and self-improve at test time is valuable.
A Survey of Large Language Models
Authors: Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, Ji-Rong Wen
First: 2023-03-31T17:28:46+00:00 · Latest: 2026-03-08T21:39:47+00:00
Comments: ongoing work; 144 pages, 1081 citations
Abstract
Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a significant challenge to develop capable AI algorithms for comprehending and grasping a language. As a major approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora, showing strong capabilities in solving various NLP tasks. Since researchers have found that model scaling can lead to performance improvement, they further study the scaling effect by increasing the model size to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these enlarged language models not only achieve a significant performance improvement but also show some special abilities that are not present in small-scale language models. To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size. Recently, the research on LLMs has been largely advanced by both academia and industry, and a remarkable progress is the launch of ChatGPT, which has attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI community, which would revolutionize the way how we develop and use AI algorithms. In this survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular, we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Besides, we also summarize the available resources for developing LLMs and discuss the remaining issues for future directions.
Summary / 总结
Language is essentially a complex, intricate system of human expressions governed by grammatical rules.
TokMem: One-Token Procedural Memory for Large Language Models
Authors: Zijun Wu, Yongchang Hao, Lili Mou
Venue: ICLR 2026
First: 2025-10-01T02:51:58+00:00 · Latest: 2026-03-08T18:30:22+00:00
Comments: Accepted by ICLR 2026
Abstract
Large language models are typically controlled via prompts, which must be repeatedly re-processed for every new query and are difficult to reuse modularly. We introduce TokMem, a procedural memory framework that compiles each reusable task procedure into a single trainable memory token. Each token serves as both a procedure index and a generation control signal that steers generation, enabling targeted behaviors with constant-size overhead. TokMem keeps the backbone LLM frozen and stores procedural knowledge entirely in these dedicated units, so new procedures can be added continually without interfering with existing ones. We evaluate TokMem on two settings: atomic recall over 1,000 Super-Natural Instructions tasks and compositional recall on multi-step function-calling. Our results show that TokMem consistently outperforms retrieval-augmented prompting while avoiding repeated context overhead. Moreover, it matches or exceeds parameter-efficient fine-tuning with substantially fewer trainable parameters.
Summary / 总结
Large language models are typically controlled via prompts, which must be repeatedly re-processed for every new query and are difficult to reuse modularly.
Reverse Distillation: Consistently Scaling Protein Language Model Representations
Authors: Darius Catrina, Christian Bepler, Samuel Sledzieski, Rohit Singh
Venue: ICLR 2026
First: 2026-03-08T16:24:05+00:00 · Latest: 2026-03-08T16:24:05+00:00
Comments: Proceedings of ICLR 2026
Abstract
Unlike the predictable scaling laws in natural language processing and computer vision, protein language models (PLMs) scale poorly: for many tasks, models within the same family plateau or even decrease in performance, with mid-sized models often outperforming the largest in the family. We introduce Reverse Distillation, a principled framework that decomposes large PLM representations into orthogonal subspaces guided by smaller models of the same family. The resulting embeddings have a nested, Matryoshka-style structure: the first k dimensions of a larger model's embedding are exactly the representation from the smaller model. This ensures that larger reverse-distilled models consistently outperform smaller ones. A motivating intuition is that smaller models, constrained by capacity, preferentially encode broadly-shared protein features. Reverse distillation isolates these shared features and orthogonally extracts additional contributions from larger models, preventing interference between the two. On ProteinGym benchmarks, reverse-distilled ESM-2 variants outperform their respective baselines at the same embedding dimensionality, with the reverse-distilled 15 billion parameter model achieving the strongest performance. Our framework is generalizable to any model family where scaling challenges persist. Code and trained models are available at https://github.com/rohitsinghlab/plm_reverse_distillation.
Summary / 总结
Unlike the predictable scaling laws in natural language processing and computer vision, protein language models (PLMs) scale poorly: for many tasks, models within the same family plateau or even decrease in performance, with mid-sized models often outperforming the largest in the family.
Ready2Unlearn: A Learning-Time Approach for Preparing Models with Future Unlearning Readiness
Authors: Hanyu Duan, Yi Yang, Ahmed Abbasi, Kar Yan Tam
First: 2025-05-16T04:33:59+00:00 · Latest: 2026-03-08T14:02:49+00:00
Abstract
Machine unlearning is the process of removing the imprint left by specific data samples during the training of a machine learning model. AI developers, including those building personalized technologies, employ machine unlearning for various purposes such as privacy protection, security, and to address ethical concerns. This paper introduces Ready2Unlearn, a learning-time optimization approach designed to facilitate future unlearning processes. Unlike the majority of existing unlearning efforts that focus on designing unlearning algorithms, which are typically implemented reactively when an unlearning request is made during the model deployment phase, Ready2Unlearn shifts the focus to the training phase, adopting a "forward-looking" perspective. Building upon well-established meta-learning principles, Ready2Unlearn proactively trains machine learning models with unlearning readiness, such that they are well prepared and can handle future unlearning requests in a more efficient and principled manner. Ready2Unlearn is model-agnostic and compatible with any gradient ascent-based machine unlearning algorithms. We evaluate the method on both language and vision tasks under various unlearning settings, including class-wise unlearning and random data unlearning. Experimental results show that by incorporating such preparedness at training time, Ready2Unlearn produces an unlearning-ready model state, which offers several key advantages when future unlearning is requested. We hope this study inspires future research on proactive strategies for equipping machine learning models with built-in unlearning readiness, particularly in modern information systems that rely heavily on user data for recommendation, search, and personalized services, where privacy risks and data deletion demands are increasingly prevalent.
Summary / 总结
Machine unlearning is the process of removing the imprint left by specific data samples during the training of a machine learning model.
ForamDeepSlice: A High-Accuracy Deep Learning Framework for Foraminifera Species Classification from 2D Micro-CT Slices
Authors: Abdelghafour Halimi, Ali Alibrahim, Didier Barradas-Bautista, Ronell Sicat, Abdulkader M. Afifi
First: 2025-11-30T14:30:16+00:00 · Latest: 2026-03-08T13:04:12+00:00
Abstract
This study presents a comprehensive deep learning pipeline for the automated classification of foraminifera species using 2D micro-CT slices derived from 3D scans. We curated a scientifically rigorous dataset of 97 micro-CT scanned specimens spanning 27 species, from which we selected 12 representative species with sufficient specimen counts (at least four 3D models each) for robust classification. To ensure methodological integrity and prevent data leakage, we employed specimen-level data splitting, resulting in 109,617 high-quality 2D slices (44,103 for training, 14,046 for validation, and 51,468 for testing). We evaluated seven state-of-the-art 2D convolutional neural network (CNN) architectures using transfer learning. Our final ensemble model, ForamDeepSlice (FDS), combining ConvNeXt-Large and EfficientNetV2-Small, achieved a test accuracy of 95.64%, with a top-3 accuracy of 99.6% and an area under the ROC curve (AUC) of 0.998 across all species. To facilitate practical deployment, we developed an interactive advanced dashboard that supports real-time slice classification and 3D slice matching using advanced similarity metrics, including SSIM, NCC, and the Dice coefficient. This work establishes new benchmarks for AI-assisted micropaleontological identification and provides a fully reproducible framework for foraminifera classification research, bridging the gap between deep learning and applied geosciences.
Summary / 总结
This study presents a comprehensive deep learning pipeline for the automated classification of foraminifera species using 2D micro-CT slices derived from 3D scans.
MetaSort: An Accelerated Approach for Non-uniform Compression and Few-shot Classification of Neural Spike Waveforms
Authors: Luca M. Meyer, Majid Zamani
First: 2026-03-08T12:17:19+00:00 · Latest: 2026-03-08T12:17:19+00:00
Comments: 4 pages (under review paper)
Abstract
Many previous works in spike sorting study spike classification and compression independently. In this paper, a novel algorithm is proposed called MetaSort to address these two problems. To deal with compression, a novel adaptive level crossing algorithm is proposed to approximate spike shapes with high fidelity. Meanwhile, the latent feature representation is used to handle the classification problem. Besides, to guarantee MetaSort is robust and discriminative, the geometric information of data is exploited simultaneously in the proposed framework by meta-transfer learning. Empirical experiments with in-vivo spike data demonstrate that MetaSort delivers promising performance, highlighting its potential and motivating continued development toward an ultra-low-power, on-chip implementation.
Summary / 总结
Many previous works in spike sorting study spike classification and compression independently.
Adaptive Transfer Clustering: A Unified Framework
Authors: Yuqi Gu, Zhongyuan Lyu, Kaizheng Wang
First: 2024-10-28T17:57:06+00:00 · Latest: 2026-03-08T12:12:57+00:00
Comments: 72 pages
Abstract
We propose a general transfer learning framework for clustering given a main dataset and an auxiliary one about the same subjects. The two datasets may reflect similar but different latent grouping structures of the subjects. We propose an adaptive transfer clustering (ATC) algorithm that automatically leverages the commonality in the presence of unknown discrepancy, by optimizing an estimated bias-variance decomposition. It applies to a broad class of statistical models including Gaussian mixture models, stochastic block models, and latent class models. A theoretical analysis proves the optimality of ATC under the Gaussian mixture model and explicitly quantifies the benefit of transfer. Extensive simulations and real data experiments confirm our method's effectiveness in various scenarios.
Summary / 总结
We propose a general transfer learning framework for clustering given a main dataset and an auxiliary one about the same subjects.
Revisiting the LiRA Membership Inference Attack Under Realistic Assumptions
Authors: Najeeb Jebreel, Mona Khalil, David Sánchez, Josep Domingo-Ferrer
First: 2026-03-08T10:03:29+00:00 · Latest: 2026-03-08T10:03:29+00:00
Comments: Accepted to PoPETs 2026(3)
Abstract
Membership inference attacks (MIAs) have become the standard tool for evaluating privacy leakage in machine learning (ML). Among them, the Likelihood-Ratio Attack (LiRA) is widely regarded as the state of the art when sufficient shadow models are available. However, prior evaluations have often overstated the effectiveness of LiRA by attacking models overconfident on their training samples, calibrating thresholds on target data, assuming balanced membership priors, and/or overlooking attack reproducibility. We re-evaluate LiRA under a realistic protocol that (i) trains models using anti-overfitting (AOF) and transfer learning (TL), when applicable, to reduce overconfidence as in production models; (ii) calibrates decision thresholds using shadow models and data rather than target data; (iii) measures positive predictive value (PPV, or precision) under shadow-based thresholds and skewed membership priors (pi <= 10%); and (iv) quantifies per-sample membership reproducibility across different seeds and training variations. We find that AOF significantly weakens LiRA, while TL further reduces attack effectiveness while improving model accuracy. Under shadow-based thresholds and skewed priors, LiRA's PPV often drops substantially, especially under AOF or AOF+TL. We also find that thresholded vulnerable sets at extremely low FPR show poor reproducibility across runs, while likelihood-ratio rankings are more stable. These results suggest that LiRA, and likely weaker MIAs, are less effective than previously suggested under realistic conditions, and that reliable privacy auditing requires evaluation protocols that reflect practical training practices, feasible attacker assumptions, and reproducibility considerations. Code is available at https://github.com/najeebjebreel/lira_analysis.
Summary / 总结
Membership inference attacks (MIAs) have become the standard tool for evaluating privacy leakage in machine learning (ML).
EFT-CoT: A Multi-Agent Chain-of-Thought Framework for Emotion-Focused Therapy
Authors: Lanqing Du, Yunong Li, YuJie Long, Shihong Chen
First: 2026-01-25T13:53:15+00:00 · Latest: 2026-03-08T09:12:02+00:00
Abstract
The use of large language models (LLMs) for Mental Health Question Answering (MHQA) offers a promising way to alleviate shortages in mental health resources. However, prior work has mainly relied on Cognitive Behavioral Therapy (CBT) and predominantly follows a top-down strategy centered on rational cognitive restructuring, providing limited support for embodied experience and primary emotion processing. To address this gap, we propose EFT-CoT, a multi-agent chain-of-thought framework grounded in Emotion-Focused Therapy (EFT). EFT-CoT operationalizes intervention as a three-stage workflow: Embodied Perception, Cognitive Exploration, and Narrative Intervention. The framework employs eight specialized agents to model key processes including somatic awareness mapping, adaptive evaluation, core belief extraction, and narrative restructuring. Based on this framework, we construct EFT-Instruct, a high-quality instruction-tuning dataset built from process-level augmentation of about 67,000 real help-seeking texts, and further fine-tune a dedicated model, EFT-LLM. Experiments show that EFT-LLM consistently outperforms strong baselines and human responses in empathic depth and structural professionalism. Ablation studies further verify the contribution of key mechanisms, while white-box auditing demonstrates the consistency and traceability of critical intermediate states. Overall, this work provides a reproducible framework-data-model pipeline for embedding EFT mechanisms into LLM-based mental health support.
Summary / 总结
The use of large language models (LLMs) for Mental Health Question Answering (MHQA) offers a promising way to alleviate shortages in mental health resources.
Adaptive Capacity Allocation for Vision Language Action Fine-tuning
Authors: Donghoon Kim, Minji Bae, Unghui Nam, Gyeonghun Kim, Suyun Lee, Kyuhong Shim, Byonghyo Shim
Venue: ICRA 2026
First: 2026-03-08T01:33:01+00:00 · Latest: 2026-03-08T01:33:01+00:00
Comments: ICRA 2026 (Official Code: https://github.com/dhkim-furiosa/LoRA-SP)
Abstract
Vision language action models (VLAs) are increasingly used for Physical AI, but deploying a pre-trained VLA model to unseen environments, embodiments, or tasks still requires adaptation. Parameter-efficient fine-tuning (PEFT), especially LoRA, is common for VLA policies, yet the exposed capacity knob, the rank, does not transfer uniformly: robotics transfer exhibits a higher and task-varying intrinsic rank than language fine-tuning. Small ranks suffice for LLMs (e.g., $r \in \{4, 8\}$), while spectral analyses indicate VLAs may require much larger ranks (e.g., $r \approx 128$) or near-full rank, a mismatch that worsens in multi-task settings. We present LoRA-SP (Select-Prune), a rank-adaptive fine-tuning method that replaces fixed-rank updates with input- and layer-wise capacity. LoRA-SP uses an SVD-style parameterization with a small router whose nonnegative scores act as singular values over a shared vector bank. The active set is chosen by an energy target on the cumulative squared scores $E(k) \ge η$, providing a direct link to approximation error via our spectral analysis. During training, $η$ concentrates energy on a few directions and teaches the router to rely on fewer vectors while preserving accuracy. This yields compact adapters that reduce cross-task interference and improve generalization. On four real-robot manipulation tasks collected on an unseen AgileX PiPER arm, across two VLA backbones ($π_0$ and SmolVLA), LoRA-SP matches or exceeds full fine-tuning with far fewer trainable parameters, and improves multi-task success by up to 31.6% over standard LoRA while remaining robust to rank choice.
Summary / 总结
Vision language action models (VLAs) are increasingly used for Physical AI, but deploying a pre-trained VLA model to unseen environments, embodiments, or tasks still requires adaptation.
X-MethaneWet: A Cross-scale Global Wetland Methane Emission Benchmark Dataset for Advancing Science Discovery with AI
Authors: Yiming Sun, Shuo Chen, Shengyu Chen, Chonghao Qiu, Licheng Liu, Youmi Oh, Sparkle L. Malone, Gavin McNicol, Qianlai Zhuang, Chris Smith, Yiqun Xie, Xiaowei Jia
First: 2025-05-23T20:24:09+00:00 · Latest: 2026-03-07T18:22:10+00:00
Abstract
Methane (CH$_4$) is the second most powerful greenhouse gas after carbon dioxide and plays a crucial role in climate change due to its high global warming potential. Accurately modeling CH$_4$ fluxes across the globe and at fine temporal scales is essential for understanding its spatial and temporal variability and developing effective mitigation strategies. In this work, we introduce the first-of-its-kind cross-scale global wetland methane benchmark dataset (X-MethaneWet), which synthesizes physics-based model simulation data from TEM-MDM and the real-world observation data from FLUXNET-CH$_4$. This dataset can offer opportunities for improving global wetland CH$_4$ modeling and science discovery with new AI algorithms. To set up AI model baselines for methane flux prediction, we evaluate the performance of various sequential deep learning models on X-MethaneWet. Furthermore, we explore four different transfer learning techniques to leverage simulated data from TEM-MDM to improve the generalization of deep learning models on real-world FLUXNET-CH$_4$ observations. Our extensive experiments demonstrate the effectiveness of these approaches, highlighting their potential for advancing methane emission modeling and identifying new opportunities for developing more accurate and scalable AI-driven climate models.
Summary / 总结
Methane (CH$_4$) is the second most powerful greenhouse gas after carbon dioxide and plays a crucial role in climate change due to its high global warming potential.
MAviS: A Multimodal Conversational Assistant For Avian Species
Authors: Yevheniia Kryklyvets, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jinxing Zhou, Fahad Shabzan Khan, Rao Anwer, Salman Khan, Hisham Cholakkal
Venue: EMNLP 2025
First: 2026-03-07T17:30:37+00:00 · Latest: 2026-03-07T17:30:37+00:00
Comments: EMNLP 2025
Abstract
Fine-grained understanding and species-specific multimodal question answering are vital for advancing biodiversity conservation and ecological monitoring. However, existing multimodal large language models face challenges when it comes to specialized topics like avian species, making it harder to provide accurate and contextually relevant information in these areas. To address this limitation, we introduce the MAviS-Dataset, a large-scale multimodal avian species dataset that integrates image, audio, and text modalities for over 1,000 bird species, comprising both pretraining and instruction-tuning subsets enriched with structured question-answer pairs. Building on the MAviS-Dataset, we introduce MAviS-Chat, a multimodal LLM that supports audio, vision, and text and is designed for fine-grained species understanding, multimodal question answering, and scene-specific description generation. Finally, for quantitative evaluation, we present MAviS-Bench, a benchmark of over 25,000 QA pairs designed to assess avian species-specific perceptual and reasoning abilities across modalities. Experimental results show that MAviS-Chat outperforms the baseline MiniCPM-o-2.6 by a large margin, achieving state-of-the-art open-source results and demonstrating the effectiveness of our instruction-tuned MAviS-Dataset. Our findings highlight the necessity of domain-adaptive multimodal LLMs for ecological applications.
Summary / 总结
Fine-grained understanding and species-specific multimodal question answering are vital for advancing biodiversity conservation and ecological monitoring.