AI4Science 论文速递

Snapshot: 20260530_0507

OOD-GraphLLM: Graph Large Language Model for Out-of-Distribution Generalized Drug Synergy Prediction

Authors: Xin Wang, Linxin Xiao, Yang Yao, Wenwu Zhu

Venue: KDD 2026

First: 2026-05-28T17:12:48+00:00 · Latest: 2026-05-28T17:12:48+00:00

Comments: 12 pages, 9 figures, ACM KDD 2026

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Drug synergy prediction (DSP) aims to identify efficacious drug combinations under various cellular contexts with different targets. However, the continual emergence of novel compounds results in variations in molecular scaffolds and sizes, causing drug synergy data to exhibit out-of-distribution (O.O.D.) shifts with respect to topological structure. Existing works rely on in-distribution (I.D.) assumption, failing to handle the O.O.D. shifts. To solve this problem, we study out-of-distribution generalized drug synergy prediction through a graph large language model for the first time. Nevertheless, O.O.D. generalized DSP is highly non-trivial, posing several challenges: i) how to discover structurally relevant and irrelevant molecular representations with respect to cell targets; ii) how to find the optimal graph neural architectures that accurately calculate molecular representations; and iii) how to jointly leverage molecular structural and semantic information in LLMs. To address these challenges, we propose OOD-GraphLLM, a novel graphLLM framework which is able to accurately predict drug synergy under O.O.D. settings via jointly optimizing molecular graph representation and biomedical semantic language representations in a unified manner. Furthermore, we finetune DrugSyn-LLM, a biomedical LLM, and employ a retrieval-augmented biomedical instruction tuning strategy to align molecular topological information and molecular semantic information with language-based reasoning for O.O.D. generalized DSP. Both the source code (https://github.com/EkkoXiao/Bio-GraphLLM) and released model (https://mn.cs.tsinghua.edu.cn/bio-graphllm/) are publicly available, where users are allowed to download model resources and interactively use the system through a web interface.

Summary / 总结

Drug synergy prediction (DSP) aims to identify efficacious drug combinations under various cellular contexts with different targets.

How's it going? Reinforcement learning in language models recruits a functional welfare axis

Authors: Andy Q Han, David J. Chalmers, Pavel Izmailov

First: 2026-05-28T17:03:18+00:00 · Latest: 2026-05-28T17:03:18+00:00

Comments: 81 pages, 43 figures, 32 tables

Abs · PDF · Code1 · Code2

Abstract

How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to its goals. We train several language models in a novel, semantically neutral maze environment. We then extract concept vectors for rewarded and punished trajectories, and evaluate those vectors in settings unrelated to the maze environment. The punishment vector behaves like a representation of negative welfare: it promotes failure and impossibility tokens, it aligns with negative emotion concepts, it negatively tracks goal-achievement, and steering with it induces negative self-reports, pathological backtracking, refusal, and uncertainty. The positive reward vector behaves as the mirror image, and the two are nearly antiparallel. These effects are robust when controlling for tile-to-reward mapping, scale, instruct tuning, RL training algorithm, model family, and LoRA versus full-finetuning, and largely persist when we replace RL with supervised fine-tuning. Importantly, the vectors are effective in models before they have undergone maze training. Combined with observations that the effects also appear in pretrain-only models, we therefore argue that this functional welfare axis pre-exists post-training: it is recruited, rather than created, by post-training. While we make no claims about any experience of welfare, the axis offers a demonstration that minimal reward signals can broadly affect model behavior by recruiting pre-existing welfare-like representations, with implications for interpretability, post-training dynamics, and alignment.

Summary / 总结

How does reinforcement learning shape a language model's internal representations?

HD-Prot: A Protein Language Model for Joint Sequence-Structure Modeling with Continuous Structure Tokens

Authors: Yi Zhou, Haohao Qu, Yunqing Liu, Shanru Lin, Le Song, Wenqi Fan

Venue: KDD 2026 long

First: 2025-12-17T06:46:27+00:00 · Latest: 2026-05-28T16:10:55+00:00

Comments: This is the long version of the corresponding paper to appear at KDD 2026

Abs · PDF · Code1 · Code2

Abstract

Proteins inherently possess a consistent sequence-structure duality. The abundance of protein sequence data, which can be readily represented as discrete tokens, has driven fruitful developments in protein language models (pLMs). A key remaining challenge, however, is how to effectively integrate continuous structural knowledge into pLMs. Current methods often discretize protein structures to accommodate the language modeling framework, which inevitably results in the loss of fine-grained information and limits the performance potential of multimodal pLMs. In this paper, we argue that such concerns can be circumvented: a sequence-based pLM can be extended to incorporate the structure modality through continuous tokens, i.e., high-fidelity protein structure latents that avoid vector quantization. Specifically, we propose a hybrid diffusion protein language model, HD-Prot, which embeds a continuous-valued diffusion head atop a discrete pLM, enabling seamless operation with both discrete and continuous tokens for joint sequence-structure modeling. It captures inter-token dependencies across modalities through a unified absorbing diffusion process, and estimates per-token distributions via categorical prediction for sequences and continuous diffusion for structures. Extensive results demonstrate that HD-Prot achieves competitive performance in unconditional sequence-structure co-generation, motif-scaffolding, protein structure prediction, and inverse folding tasks. Furthermore, our method can perform on par with state-of-the-art multimodal pLMs, despite being developed under limited computational resources (i.e., less than one-tenth the budget for modality extension fine-tuning). It highlights the viability of simultaneously estimating categorical and continuous distributions within a unified language model architecture, offering a promising alternative direction for multimodal pLMs.

Summary / 总结

Proteins inherently possess a consistent sequence-structure duality.

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

Authors: Tong Ye, Hang Yu, Tengfei Ma, Xuhong Zhang, Jianguo Li, Peng Di, Peiyu Liu, Jianwei Yin, Wenhai Wang

Venue: KDD 2026

First: 2026-05-28T14:57:02+00:00 · Latest: 2026-05-28T14:57:02+00:00

Comments: Accepted by KDD 2026

Abs · PDF · Code1 · Code2

Abstract

Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.

Summary / 总结

Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data.

Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models

Authors: Jaa-Yeon Lee, Yeobin Hong, Taesung Kwon, Jong Chul Ye

Venue: ICML 2026

First: 2026-05-28T14:57:01+00:00 · Latest: 2026-05-28T14:57:01+00:00

Comments: ICML 2026, Project page: https://jaayeon.github.io/AGSM

Abs · PDF · Code1 · Code2 · Project1

Abstract

Diffusion models generate highly realistic images but often struggle with precise text-image alignment. While recent post-training methods improve alignment using external rewards or human preference signals, their performance heavily depends on reward quality and does not directly address alignment within the diffusion process itself. Recent reward-free approaches such as SoftREPA demonstrate that optimizing soft text tokens via contrastive learning can effectively improve text-image representation alignment, outperforming standard parameter-efficient fine-tuning baselines. However, the contrastive formulation can excessively penalize negative pairs, which manifests as characteristic failure cases such as over-counting and repetition. To address this issue, we propose a lightweight, reward-free post-training method that refines soft tokens by integrating contrastive alignment guidance directly into the score-matching objective of diffusion models. By assigning alignment directions at the score level, our approach mitigates these limitations and yields more coherent and semantically faithful generations. Experiments show that our method matches SoftREPA while substantially improving its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark. Our method is seamlessly applicable to existing diffusion backbones (SD1.5, SDXL, and SD3), and is complementary to existing RL-based diffusion post-training methods. Project page: https://jaayeon.github.io/AGSM

Summary / 总结

Diffusion models generate highly realistic images but often struggle with precise text-image alignment.

Towards Understanding the Shape of Representations in Protein Language Models

Authors: Kosio Beshkov, Anders Malthe-Sørenssen

Venue: ICLR 2026 poster

First: 2025-09-29T15:06:24+00:00 · Latest: 2026-05-28T13:38:35+00:00

Comments: Accepted as a poster at ICLR 2026. OpenReview: https://openreview.net/forum?id=Dnn8SSBJaY

Abs · PDF · Code1 · Code2

Abstract

While protein language models (PLMs) are one of the most promising avenues of research for future de novo protein design, the way in which they transform sequences to hidden representations, as well as the information encoded in such representations is yet to be fully understood. Several works have attempted to propose interpretability tools for PLMs, but they have focused on understanding how individual sequences are transformed by such models. Therefore, the way in which PLMs transform the whole space of sequences along with their relations is still unknown. In this work we attempt to understand this transformed space of sequences by identifying protein structure and representation with square-root velocity (SRV) representations and graph filtrations. Both approaches naturally lead to a metric space in which pairs of proteins or protein representations can be compared with each other. We analyze different types of proteins from the SCOP dataset and show that the Karcher mean and effective dimension of the SRV shape space follow a non-linear pattern as a function of the layers in ESM2 models of different sizes. Furthermore, we use graph filtrations as a tool to study the context lengths at which models encode the structural features of proteins. We find that PLMs preferentially encode immediate as well as local relations between residues, but start to degrade for larger context lengths. The most structurally faithful encoding tends to occur close to, but before the last layer of the models, indicating that training a folding model ontop of these layers might lead to improved folding performance.

Summary / 总结

While protein language models (PLMs) are one of the most promising avenues of research for future de novo protein design, the way in which they transform sequences to hidden representations, as well as the information encoded in such representations is yet to be fully understood.

BuilDyn: Excitation-Driven Data Generation for Building Thermal Dynamics Modeling and Control

Authors: Felix Koch, Thomas Krug, Fabian Raisch, Benjamin Schäfer, Benjamin Tischler

First: 2026-05-28T12:30:50+00:00 · Latest: 2026-05-28T12:30:50+00:00

Abs · PDF · Code1 · Code2

Abstract

Machine learning (ML) is increasingly used for data-driven modeling of buildings to enable downstream tasks such as fault detection and diagnosis, and energy-efficient control. While recent work improves generalization across building characteristics, weather, and occupancy, generalization also depends on sufficient exploration of the control-driven system state space. Existing real-world datasets and simulation environments predominantly reflect stationary operation under fixed control policies, resulting in limited excitation and reduced robustness to unseen operating conditions. This paper introduces BuilDyn, a package based on BuilDa that enables customizable excitation strategies for control-oriented data generation. BuilDyn further supports sampling from representative building distributions and provides a Python interface for easy integration into machine learning pipelines. We demonstrate the benefits of BuilDyn by comparing the performance of data-driven ML models trained on non-excited and excited data for one building. With BuilDyn, we hope to advance scalable control-oriented modeling and support future directions such as transfer learning and building-specific foundation models.

Summary / 总结

Machine learning (ML) is increasingly used for data-driven modeling of buildings to enable downstream tasks such as fault detection and diagnosis, and energy-efficient control.

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

Authors: Zhan Liu, Changli Tang, Yuxin Wang, Zhiyuan Zhu, Youjun Chen, Yiwen Shao, Tianzi Wang, Lei Ke, Zengrui Jin, Chao Zhang

Venue: ICML 2026

First: 2026-02-20T04:06:07+00:00 · Latest: 2026-05-28T12:11:44+00:00

Comments: Accepted to ICML 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints, and datasets are available at https://github.com/liuzhan22/JAEGER.

Summary / 总结

Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio.

The Impact of Semantic Pairs on Self-Supervised Representation Learning

Authors: Mohammad Alkhalefi, Georgios Leontidis, Mingjun Zhong

First: 2025-10-09T18:31:55+00:00 · Latest: 2026-05-28T11:57:44+00:00

Comments: 19 pages, 7 figures, 5 tables

Abs · PDF · Code1 · Code2

Abstract

Instance discrimination learns visual representations by treating different augmented views of the same image as positive pairs. While this encourages invariance to handcrafted transformations, same-image positives can preserve nuisance correlations such as background, texture, illumination, and object-specific details. Semantic positive pairs, i.e., different same-class instances, may reduce these correlations by presenting objects across diverse contexts. However, previous studies often combine semantic pairs with augmented positives or false neighbors (i.e., incorrectly mapped semantic pairs), making it difficult to isolate the effect of semantic pairing. We present a controlled empirical study of semantic positive pairs for self-supervised representation learning. From ImageNet-1K, we construct two matched subsets: an augmented-pair baseline and a manually curated semantic-pair dataset with the same class composition and training-pair count. We use these datasets to compare representative contrastive and non-contrastive SSL methods under matched training conditions. Across transfer learning and object detection evaluations, semantic-pair pretraining consistently improves generalisation over augmented-pair pretraining. Additional ablations show that semantic pairs induce invariances beyond the standard transformation pipeline. Among the evaluated methods, contrastive learning benefits most strongly from semantic pairs, with SimCLR showing the largest relative improvement. These results clarify the role of semantic positive pairs in SSL and provide guidance for selecting and designing frameworks that can exploit semantic pair information effectively

Summary / 总结

Instance discrimination learns visual representations by treating different augmented views of the same image as positive pairs.

MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains

Authors: Ashutosh Ojha, Vinay Aggarwal, Ashutosh Srivastava, Siddharth Yedlapati, Yaman K Singla, Jitendra Ajmera

First: 2026-05-28T11:44:32+00:00 · Latest: 2026-05-28T11:44:32+00:00

Abs · PDF · Code1 · Code2

Abstract

Real-world tasks often lack large labeled datasets, motivating extensive work on learning in low-data regimes. However, existing approaches such as few-shot prompting, instruction tuning, and synthetic data generation, continue to treat labeled or pseudo-labeled data as the primary learning signal. In contrast, human practitioners acquire expertise through repeated, self-directed interaction with the open web, progressively refining both domain knowledge and search strategies. We propose MEMENTO, a framework that treats the web as a learning signal rather than a stateless retrieval interface. MEMENTO operates at two levels: within each session, it conducts iterative web exploration via an Adaptive Exploration Tree (AET) that decomposes tasks into evolving questions and reflects on intermediate findings; across sessions, it accumulates experience through dual-channel memory, separating declarative knowledge (facts) from procedural knowledge (search strategies). This design enables agents to learn reusable research strategies and domain expertise from trajectories of web interaction without additional model training. We evaluate MEMENTO on two low-data professional domains: sales automation and legal research. Our empirical results show consistent improvements in performance over ReAct based baselines (+25.6% on sales automation and 36.5% on legal research), demonstrating that the web can serve as a scalable learning source for acquiring task-specific expertise in data-scarce settings.

Summary / 总结

Real-world tasks often lack large labeled datasets, motivating extensive work on learning in low-data regimes.

PRIM: Meta-Learned Bayesian Root Cause Analysis

Authors: Christopher Lohse, Anish Dhir, Amadou Ba, Bradley Eck, Marco Ruffini, Jonas Wahl

First: 2026-05-09T08:14:09+00:00 · Latest: 2026-05-28T10:44:26+00:00

Abs · PDF · Code1 · Code2

Abstract

Root cause analysis (RCA) in complex systems is challenging due to error propagation across multiple variables, the need for structural causal knowledge, and the computational cost of inference at test time. We introduce PRIM (Prior-fitted Root cause Identification with Meta-learning), a causal meta-learning approach that frames RCA as a Bayesian inference task over a synthetic prior of causal models. By marginalising out structural uncertainty, PRIM implicitly identifies changes in the data-generating mechanism between baseline and anomalous periods. In doing so, PRIM infers distributional differences without explicit statistical testing, and implicitly learns causal structure without model fitting at test time. Following the simulation-based meta-learning paradigm of prior-fitted networks, PRIM uses a Model-Averaged Causal Estimation (MACE) transformer neural process that jointly attends over observational and anomalous samples and the causal structure of nodes, enabling zero-shot inference in 17,ms for systems with up to 100 variables. Across synthetic benchmarks and two realistic benchmark datasets, PetShop and CausRCA, PRIM is competitive with methods that are aware of the system's causal graphical structure a priori while outperforming graph-unaware methods on several tasks. Lightweight fine-tuning to specific domains and data dynamics improves performance further.

Summary / 总结

Root cause analysis (RCA) in complex systems is challenging due to error propagation across multiple variables, the need for structural causal knowledge, and the computational cost of inference at test time.

Uncertainty-Aware Transfer Learning for Cross-Building Energy Forecasting: Toward Robust and Scalable District-Level Energy Management

Authors: Shadmehr Zaregarizi, Khashayar Yavari

First: 2026-05-28T10:28:05+00:00 · Latest: 2026-05-28T10:28:05+00:00

Comments: 5 pages, 3 figures, 2 tables. Accepted at BALANCES'26 (6th ACM International Workshop on Big Data and Machine Learning for Smart Buildings and Cities), Banff, Alberta, Canada, June 22, 2026. This is the author's accepted manuscript; final published version DOI will be activated after June 22, 2026

Abs · PDF · Code1 · Code2

Abstract

Scaling data-driven energy forecasting to district level requires models that can be re-used across buildings with minimal target-domain data and honest uncertainty estimates. We present an uncertainty-aware transfer learning (TL) framework for cross-building energy forecasting based on the Temporal Fusion Transformer (TFT), evaluated on a newly released high-resolution real sub-meter dataset: an educational building at Aalborg University, Denmark (source) and the multi-typology NEST building at EMPA, Switzerland (target). We introduce the Transfer Robustness Index (TRI), an architecture-agnostic metric for quantifying generalization quality across domain gaps. A four-strategy layer-freezing ablation shows that Probe-Only fine-tuning, updating only 455 output-layer parameters out of 806K, achieves the best transfer quality (TRI = 3,097), outperforming full fine-tuning and suggesting that TFT encoders learn transferable temporal representations. Monte Carlo Dropout yields a prediction interval coverage probability of 93.2%, close to the nominal 95% target. A data-scarcity analysis further shows monotonic improvement with increasing target-domain data, providing practical guidance for district energy deployment.

Summary / 总结

Scaling data-driven energy forecasting to district level requires models that can be re-used across buildings with minimal target-domain data and honest uncertainty estimates.

NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs

Authors: Shuaidi Wang, Zhan Zhuang, Ruping Huang, Yu Zhang

First: 2026-05-28T10:13:50+00:00 · Latest: 2026-05-28T10:13:50+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive generative paradigm. Given the prohibitive computational cost of full fine-tuning, Parameter-Efficient Fine-Tuning (PEFT) has become the standard approach. However, existing PEFT methods (e.g., LoRA), originally tailored for autoregressive models, rely on static parameters that are agnostic to the noise level. Consequently, they ignore the intrinsic dynamics of the diffusion process, where input distributions and generation difficulty shift significantly along the denoising trajectory, rendering them suboptimal for dLLMs. To address this, we propose Noise-aware Low-Rank Adaptation (NaRA), which introduces a low-rank core matrix generated by a lightweight, globally shared hypernetwork conditioned on the noise level. This design enables the update matrices to vary continuously along the diffusion process while keeping parameter and latency overhead negligible. We provide a theoretical justification for the proposed NaRA framework and empirically demonstrate consistent improvements over noise-agnostic baselines across commonsense reasoning, mathematical reasoning, and code generation benchmarks. Our code is available at https://github.com/generaldi/NaRA.

Summary / 总结

Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive generative paradigm.

Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

Authors: Shicheng Fan, Haochang Hao, Dehai Min, Weihao Liu, Philip S. Yu, Lu Cheng

First: 2026-05-28T09:14:37+00:00 · Latest: 2026-05-28T09:14:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely on NLI verifiers, LLM judges, or knowledge-verification pipelines that are expensive to deploy at RL scale and often unreliable for rare-entity facts, where accurate reward signals are especially important. We propose CorVer (Corpus Verify), a lightweight, plug-in-ready process reward that replaces neural verifiers with a corpus-grounded signal derived from Wikipedia co-occurrence statistics. CorVer assigns sentence-level credit and maps it to token-level advantages via a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Across 30 (model, benchmark) cells spanning six instruction-tuned models (3B to 14B) and five QA benchmarks, CorVer improves over the raw baseline for every cell, with an average TriviaQA gain of +4.1 pp. It also outperforms four neural-verifier baselines in 18 of 20 cells under their feasible configurations, while training 4.8 to 8.4x faster.

Summary / 总结

Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma.

Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion

Authors: Yizhuo Lu, Changde Du, Qingyu Shi, Hang Chen, Jie Peng, Liuyun Jiang, Shuangchen Zhao, Huiguang He

First: 2026-05-28T08:33:43+00:00 · Latest: 2026-05-28T08:33:43+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Modeling the interplay between external stimuli and internal neural representations is a pivotal research area for Brain-Computer Interfaces (BCIs). A major limitation of prior work is the prevailing paradigm of specialized, single-task models, which curtails versatility and neglects inter-task synergies. To address this, we propose Mind-Omni, the first versatile framework that unifies seven distinct encoding and decoding tasks through a discrete diffusion paradigm. At its core is a novel Brain Tokenizer that transforms heterogeneous, continuous brain signals into standardized, discrete tokens. This enables direct, token-level interactions for mutual understanding and generation between any two or more modalities within a shared semantic space. To unlock advanced reasoning capabilities, we further curate a specialized Brain Question Answering (BQA) instruction-tuning dataset. Our model not only establishes a new state-of-the-art among multi-task unified frameworks but also provides strong evidence for multi-task synergy. By demonstrating performance competitive with, and at times superior to, larger specialized models, our work offers a powerful new paradigm for neural modeling and paves the way for foundation models of neural activity. The code is publicly available at https://github.com/ReedOnePeck/Mind-Omni.

Summary / 总结

Modeling the interplay between external stimuli and internal neural representations is a pivotal research area for Brain-Computer Interfaces (BCIs).

On the Construction and Implications of Low-Loss Valleys in LoRA-based Bayesian Inference

Authors: Daniel Dold, Emanuel Sommer, Julius Kobialka, Oliver Dürr, David Rügamer

First: 2026-05-28T08:24:45+00:00 · Latest: 2026-05-28T08:24:45+00:00

Abs · PDF · Code1 · Code2

Abstract

While parameter-efficient fine-tuning methods like low-rank adaptation (LoRA) are standard for large language models, principled estimation of epistemic uncertainty remains challenging. Recent results in the LoRA regime suggest that discrete multi-mode approaches such as deep ensembles offer little benefit over single-mode methods. This contradicts broader observations in deep learning, where ensembling independent optima typically improves generalization, and linking these modes through continuous low-loss valleys further enhances Bayesian model averaging (BMA). Whether such structure exists in the LoRA space and whether it yields functional diversity missed by local or discrete methods has not been studied. We introduce LoRA-Curve, a segmented Bézier curve parameterization in the LoRA space, with two variants: a free configuration that jointly optimizes all control points, and an anchored configuration that connects independently fine-tuned LoRA optima. We prove pathwise continuity and Lipschitz regularity of the loss along the curve and empirically show, across reasoning and classification benchmarks with Qwen2.5 7B, that linear interpolation encounters loss barriers, while our anchored multi-segment curves connect independent optima through continuous low-loss valleys. Combined with flat-minima perturbations and a Jensen-Shannon divergence regularizer, LoRA-Curve yields measurably higher mutual information of the predictive distribution without sacrificing performance, and links continuous parameter-space traversal to functional diversity.

Summary / 总结

While parameter-efficient fine-tuning methods like low-rank adaptation (LoRA) are standard for large language models, principled estimation of epistemic uncertainty remains challenging.

On-Policy Replay for Continual Supervised Fine-Tuning

Authors: Yan Chen, Taojie Zhu, Meng Zhang, Xin Chen, Jiaqi Huang, Dongyang Xu, Yizhi Wang

First: 2026-05-28T07:19:47+00:00 · Latest: 2026-05-28T07:19:47+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Continual supervised fine-tuning (SFT) is the de facto recipe for adapting large language models (LLMs) to a stream of downstream tasks, but it suffers from catastrophic forgetting of earlier capabilities. Recent work shows that on-policy signals -- training on the model's own outputs -- reduce forgetting more reliably than off-policy supervision. Existing on-policy methods route this signal through a new training objective (e.g., self-distillation losses with a teacher copy), inheriting an extra forward pass, schedule sensitivity, and stylistic drift from the teacher.We instead route the on-policy signal through the training data source. Our method, On-Policy Replay (OPR), rolls out the most recent checkpoint on a small budget of historical prompts, filters the generations by a task reward, and replays the surviving (prompt, model response) pairs as ordinary SFT examples. There is no teacher, no auxiliary loss, and no on-the-fly distillation. Across three 7--8B instruction-tuned backbones (Qwen2.5-7B-Instruct, Qwen3-8B, Llama3.1-8B-Instruct) on the TRACE continual-learning benchmark, OPR consistently reduces forgetting; on the sharpest stress test (Qwen2.5-7B-Instruct, Sequential SFT BWT -13.93), OPR lifts BWT to -0.65 at a 10% replay budget and to -2.29 at a 1% budget -- a 46% reduction in |BWT| over a tuned Vanilla Replay baseline, with 42--46% reductions observed across all three backbones. We give a KL-shrinkage interpretation that places OPR and prior on-policy distillation methods on a single axis, and we present a counterintuitive finding that explains why Vanilla Replay is already a strong baseline: low-score replay is uniformly worse than Vanilla Replay, demonstrating that the active ingredient in OPR is the on-policy distribution, not the response quality alone.Our code is available at https://github.com/Yancey2024/OnPolicyReplay.

Summary / 总结

Continual supervised fine-tuning (SFT) is the de facto recipe for adapting large language models (LLMs) to a stream of downstream tasks, but it suffers from catastrophic forgetting of earlier capabilities.

Cross-Chirality Generalization by Axial Vectors for Hetero-Chiral Protein-Peptide Interaction Design

Authors: Ziyi Yang, Zitong Tian, Yinjun Jia, Tianyi Zhang, Jiqing Zheng, Hao Wang, Yubu Su, Juncai He, Lei Liu, Yanyan Lan

Venue: ICML 2026

First: 2026-02-13T02:46:29+00:00 · Latest: 2026-05-28T07:16:41+00:00

Comments: This version (v2) includes minor edits. The paper has been accepted to ICML 2026. Codes are available at https://github.com/YZY010418/PepMirror

Abs · PDF · Code1 · Code2 · Code3

Abstract

D-peptide binders targeting L-proteins have promising therapeutic potential. Despite rapid advances in machine learning-based target-conditioned peptide design, generating D-peptide binders remains largely unexplored. In this work, we show that by injecting axial features to $E(3)$-equivariant (polar) vector features, it is feasible to achieve cross-chirality generalization from homo-chiral (L--L) training data to hetero-chiral (D--L) design tasks. By implementing this method within a latent diffusion model, we achieved D-peptide binder design that not only outperforms existing tools in \textit{in silico} benchmarks, but also demonstrates efficacy in wet-lab validation. To our knowledge, our approach represents the first wet-lab validated generative AI for the \textit{de novo} design of D-peptide binders, offering new perspectives on handling chirality in protein design. Codes are available at https://github.com/YZY010418/PepMirror .

Summary / 总结

D-peptide binders targeting L-proteins have promising therapeutic potential.

MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

Authors: Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, Junjie Hu

Venue: ACL 2026

First: 2025-07-13T10:52:59+00:00 · Latest: 2026-05-28T07:00:50+00:00

Comments: Findings of ACL 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR

Summary / 总结

Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation.

Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning

Authors: Yuheng Lei, Sitong Mao, Shunbo Zhou, Hongyuan Zhang, Xuelong Li, Ping Luo

First: 2025-06-06T11:13:04+00:00 · Latest: 2026-05-28T06:41:48+00:00

Comments: Accepted to Transactions on Machine Learning Research (TMLR) at https://openreview.net/forum?id=MHVBrjS8cG . Code is available at https://github.com/HarryLui98/DMPEL

Abs · PDF · Code1 · Code2 · Code3

Abstract

A generalist agent must continuously learn and adapt throughout its lifetime, achieving efficient forward transfer while minimizing catastrophic forgetting. Previous work within the dominant pretrain-then-finetune paradigm has explored parameter-efficient fine-tuning for single-task adaptation, effectively steering a frozen pretrained model with a small number of parameters. However, in the context of lifelong learning, these methods rely on the impractical assumption of a test-time task identifier and restrict knowledge sharing among isolated adapters. To address these limitations, we propose Dynamic Mixture of Progressive Parameter-Efficient Expert Library (DMPEL) for lifelong robot learning. DMPEL progressively builds a low-rank expert library and employs a lightweight router to dynamically combine experts into an end-to-end policy, enabling flexible and efficient lifelong forward transfer. Furthermore, by leveraging the modular structure of the fine-tuned parameters, we introduce expert coefficient replay, which guides the router to accurately retrieve frozen experts for previously encountered tasks. This technique mitigates forgetting while being significantly more storage- and computation-efficient than experience replay over the entire policy. Extensive experiments on the lifelong robot learning benchmark LIBERO demonstrate that our framework outperforms state-of-the-art lifelong learning methods in success rates during continual adaptation, while utilizing minimal trainable parameters and storage.

Summary / 总结

A generalist agent must continuously learn and adapt throughout its lifetime, achieving efficient forward transfer while minimizing catastrophic forgetting.

Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning

Authors: Shunta Asano, Jeonghun Baek, Toshihiko Yamasaki

First: 2026-05-28T06:03:52+00:00 · Latest: 2026-05-28T06:03:52+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent studies have shown that code-switching data (CSD), in which multiple languages are mixed within the same context, can improve cross-lingual transfer and multilingual alignment in large language models (LLMs). However, existing studies primarily focus on bilingual transfer between English and a target language, leaving multilingual settings involving three or more languages largely unexplored. In this work, we investigate multilingual code-switching instruction tuning across four languages: English, Japanese, Korean, and Chinese. We evaluate multilingual understanding on Belebele. Our experiments show that simple sentence-level multilingual CSD consistently improves average multilingual performance across all four languages, indicating that multilingual code-switching can be effective beyond bilingual transfer settings.

Summary / 总结

Recent studies have shown that code-switching data (CSD), in which multiple languages are mixed within the same context, can improve cross-lingual transfer and multilingual alignment in large language models (LLMs).

Soro: A Lightweight Foundation Model and Chatbot for Tajik

Authors: Stanislav Liashkov, Haitz Sáez de Ocáriz Borde, Azizjon Azimi, Khushbakht Shoymardonov, Shuhratjon Khalilbekov, Bonu Boboeva

First: 2026-04-09T15:04:32+00:00 · Latest: 2026-05-28T05:13:49+00:00

Abs · PDF · Code1 · Code2

Abstract

We present Soro, a family of Tajik-specialized conversational large language models (LLMs) designed for real-world deployment under tight compute and connectivity constraints in Tajikistan. Starting from open-weight Gemma 3 checkpoints, we perform Tajik-only continual pretraining on a curated 1.9-billion-token corpus spanning filtered web text, PDF documents, and curriculum-aligned educational materials, followed by supervised instruction tuning on 40K Tajik teacher-style examples. To enable rigorous evaluation despite the limited coverage of Tajik in standard benchmarks, we introduce a suite of Tajik benchmarks covering general knowledge, linguistic competence, and school- and university entrance-exam domains, and we open-source them on Hugging Face. Across these Tajik benchmarks, Soro substantially outperforms same-size Gemma 3 baselines while retaining strong English performance on standard datasets. We further show that FP8 and INT4 quantization of Soro preserves most Tajik-language gains while reducing memory requirements for edge deployment, supporting an ongoing education-sector pilot and planned scale-out across schools in Tajikistan.

Summary / 总结

We present Soro, a family of Tajik-specialized conversational large language models (LLMs) designed for real-world deployment under tight compute and connectivity constraints in Tajikistan.

FoRA: Fisher-orthogonal Rank Adaptation for Parameter-Efficient Fine-Tuning

Authors: Juneyoung Park, Seongbae Lee, Han-Sang Lee, Kyuho Lee, Minjae Kim, Seungheon Hyeon, Kiduk Kwon, Seongwan Kim, Jaeho Lee

Venue: EMNLP 2026

First: 2026-05-28T03:47:00+00:00 · Latest: 2026-05-28T03:47:00+00:00

Comments: EMNLP 2026

Abs · PDF · Code1 · Code2

Abstract

Parameter-efficient fine-tuning(PEFT) has largely focused on LoRA and its accuracy-oriented variants, leaving the original goal of reducing trainable parameters has receivedcomparatively little attention. We introduce FoRA, which revisits this goal by reducing the number of adapted layers rather than adapter rank. FoRA selects task-informative layers via a single-pass diagonal Fisher score (under 1% of training cost) and trains the LoRA down-projection at selected layers on the Stiefel manifold, preserving column orthonormality and effective rank. FoRA consistently outperforms LoRA and DoRA at half their parameter budget, and falls within 0.7-0.8 accuracy points of AdaLoRA at one-quarter its parameter count, across five LLaMA-family backbones. Cross-architecture experiments on twelve backbones from the LLaMA, Qwen3, and Gemma families confirm consistent gains from 270M to 32B parameters. The two components combine super-additively: Fisher selection alone matches rank reduction at the same budget, while the Stiefel constraint provides the decisive additional gain.

Summary / 总结

Parameter-efficient fine-tuning(PEFT) has largely focused on LoRA and its accuracy-oriented variants, leaving the original goal of reducing trainable parameters has receivedcomparatively little attention.

An accuracy-aware extension to LRP-based pruning for CNNs to prevent cascading accuracy degradation in data-scarce transfer learning

Authors: Daisuke Yasui, Toshitaka Matsuki, Hiroshi Sato

First: 2025-11-13T23:52:57+00:00 · Latest: 2026-05-28T02:27:32+00:00

Comments: Accepted to scientific reports. The title was revised during the peer review process

Abs · PDF · Code1 · Code2

Abstract

Convolutional Neural Networks (CNNs) pre-trained on large-scale datasets such as ImageNet are widely used as feature extractors to construct high-accuracy classification models from scarce data for specific tasks. In such scenarios, fine-tuning the pre-trained CNN is difficult due to data scarcity, necessitating the use of fixed weights. However, when the weights are kept fixed, many filters that do not contribute to the target task remain in the model, leading to unnecessary redundancy and reduced efficiency. Therefore, effective methods are needed to reduce model size by pruning filters that are unnecessary for inference. To address this, approaches utilizing Layer-wise Relevance Propagation (LRP) have been proposed. LRP quantifies the contribution of each filter to the inference result, enabling the pruning of filters with low relevance. However, existing LRP-based pruning methods have been observed to cause cascading accuracy degradation. In this study, we introduce an accuracy-aware pruning control mechanism for existing LRP-based filter pruning methods, which suppresses cascading accuracy degradation by dynamically adjusting the pruning rate and the pruning order using the harmonic mean of class accuracy, and compresses the pre-trained model while preserving task-specific performance in a small-data environment. We demonstrate that this control mechanism effectively mitigates cascading accuracy degradation and achieves higher classification accuracy compared to existing LRP-based pruning methods, improving the class-averaged area under the accuracy-pruning-rate curve (AUC) of VGG16 by approximately 15\% over conventional LRP-based approaches.

Summary / 总结

Convolutional Neural Networks (CNNs) pre-trained on large-scale datasets such as ImageNet are widely used as feature extractors to construct high-accuracy classification models from scarce data for specific tasks.

Advancing multi-site emission control: A physics-informed transfer learning framework with mixture of experts for carbon-pollutant synergy

Authors: Yuxuan Ying, Hanqing Yang, Kaige Wang, Yu Hu, Zhiming Zheng, Yunliang Jiang, Xiaoqing Lin, Xiaodong Li, Jun Chen

First: 2026-04-29T11:54:18+00:00 · Latest: 2026-05-28T01:31:25+00:00

Comments: Supplementary materials will be released after the final version is finalized

Abs · PDF · Code1 · Code2

Abstract

Municipal solid waste incineration (MSWI) converts urban waste to energy but simultaneously emits carbon dioxide, carbon monoxide and multiple regulated air pollutants whose formation is tightly coupled within a single combustion system. Controlling these emissions across a network of diverse facilities poses a fundamentally different challenge from optimising a single plant: data-driven models trained at one site capture local statistical patterns that rarely survive transfer to another, because they lack the physical constraints and regime-level structure needed to generalise. Here we show that shared emission-control relationships can be identified across heterogeneous MSWI plants when physical conservation laws, operating-regime heterogeneity and carbon-pollutant coupling are treated jointly. We develop a carbon-pollutant mixture-of-experts (CPMoE) model that routes process observations through regime-specific expert networks under conservation-based regularisation, and combine it with physics-informed transfer learning to adapt a reference model to new facilities. Across 13 plants, CPMoE predicts six major pollutants and a composite system-level risk index with source-domain R2 of 0.668-0.904 and 0.666-0.970, respectively; after transfer to 12 target plants these values remain 0.661-0.842 and 0.610-0.841. Expert-utilisation patterns show that adaptation proceeds through structured regime re-weighting rather than re-learning from scratch. Embedding the transferred model in an offline digital twin and screening candidate operating adjustments against historical process records yields consistent risk-index reductions of 3.6-6.3% with simultaneous pollutant co-reductions in 94-100% of evaluated samples. These findings suggest a practical route toward transferable, system-level decision support for carbon-pollutant co-control in heterogeneous waste-to-energy networks.

Summary / 总结

Municipal solid waste incineration (MSWI) converts urban waste to energy but simultaneously emits carbon dioxide, carbon monoxide and multiple regulated air pollutants whose formation is tightly coupled within a single combustion system.

PROTOCOL: Late Interaction Retrieval for Protein Homolog Search

Authors: Gabrielle Cohn, Rohan Gumaste, Minh Hoang, Vihan Lakshman

First: 2026-05-27T22:50:48+00:00 · Latest: 2026-05-27T22:50:48+00:00

Abs · PDF · Code1 · Code2

Abstract

Protein homology search underlies function annotation, structure prediction, and evolutionary analysis, but remains challenging in the "twilight zone," where global sequence similarity is weak and classical alignment methods lose sensitivity. Protein language models provide context-aware representations that could improve alignment sensitivity in this regime. However, prior protein embedding-based retrieval pipelines often pool these representations into a single vector, potentially obscuring local motifs, domains, or conserved residues that reveal remote homology. We introduce ProtoCol, a model which represents proteins as sets of residue embeddings and uses ColBERT-style late interaction to test whether residue-level comparison improves homolog retrieval. ProtoCol encodes proteins independently, keeps candidate representations pre-computable, and scores candidates with MaxSim over residue embeddings. On SCOPe superfamily and Pfam clan benchmarks, ProtoCol outperforms sequence-composition, alignment-based, pooled PLM, and trained single-vector baselines, supporting late interaction as an effective retrieval layer for remote homology search.

Summary / 总结

Protein homology search underlies function annotation, structure prediction, and evolutionary analysis, but remains challenging in the "twilight zone," where global sequence similarity is weak and classical alignment methods lose sensitivity.

HumorGen: Cognitive Synergy for Humor Generation in Large Language Models via Persona-Based Distillation

Authors: Edward Ajayi, Prasenjit Mitra

First: 2026-03-19T13:12:53+00:00 · Latest: 2026-05-27T20:51:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Humor generation poses a significant challenge for Large Language Models (LLMs), because their standard training objective (next-token prediction) inherently conflicts with the surprise and incongruity required for comedy. To bridge this gap, we introduce the Cognitive Synergy Framework, a methodology for generating highquality humor data inspired by psychological theories of humor. Utilizing a Mixtureof-Thought (MoT) approach, we deploy six cognitive personas (e.g., The Absurdist, The Cynic) to synthesize diverse comedic perspectives for a given prompt. This framework produces a theory-grounded dataset, which we use to fine-tune a 7B-parameter student model. We further evaluate two alignment strategies, Direct Preference Optimization (DPO) and an offline group-relative variant O-GRPO, finding that neither improves over SFT. However, our 7B HumorGen model variants significantly outperform larger instruction-tuned baselines and achieve top-tier open-weight performance while remaining competitive with frontier proprietary systems. These results suggest that cognitively driven data curation is more critical than alignment algorithms or model scale for humor generation.

Summary / 总结

Humor generation poses a significant challenge for Large Language Models (LLMs), because their standard training objective (next-token prediction) inherently conflicts with the surprise and incongruity required for comedy.

Calibrating Generative Models to Distributional Constraints

Authors: Henry D. Smith, Nathaniel L. Diamant, Brian L. Trippe

Venue: ICML

First: 2025-10-11T04:49:28+00:00 · Latest: 2026-05-27T20:43:16+00:00

Comments: To appear at the International Conference on Machine Learning (ICML), 2026. Codebase accompanying the paper is available at: https://github.com/smithhenryd/cgm

Abs · PDF · Code1 · Code2 · Code3

Abstract

Generative models frequently suffer miscalibration, wherein statistics of the sampling distribution, such as the fraction of generations in a given class, deviate from desired values. We frame calibration as a constrained optimization problem and seek the closest model in Kullback-Leibler divergence satisfying a calibration constraint. To address the intractability of imposing these constraints exactly, we introduce two surrogate objectives for fine-tuning: (1) the relax loss, which replaces the constraint with a miscalibration penalty, and (2) the reward loss, which converts calibration into a reward fine-tuning problem. We demonstrate that these approaches substantially reduce calibration error across hundreds of simultaneous constraints and models with up to nine billion parameters, spanning applications in protein design, image generation, and language modeling.

Summary / 总结

Generative models frequently suffer miscalibration, wherein statistics of the sampling distribution, such as the fraction of generations in a given class, deviate from desired values.

Label-Free Reinforcement Learning via Cross-Model Entropy

Authors: Matt Gorbett, Hossein Shirazi

First: 2026-05-27T19:04:35+00:00 · Latest: 2026-05-27T19:04:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Post-training large language models with reinforcement learning is bottlenecked by the reward signal. Existing approaches require either ground-truth verifiable rewards, restricting training to domains with automatic correctness checks (e.g., mathematics, code execution), or human preference labels, which are expensive to collect and prone to reward hacking. Recent label-free methods replace ground-truth verifiers with self-referential signals like majority voting or token entropy over a model's own outputs, but risk reinforcing a model's own errors. In this work we propose Cross-Model Entropy (CME), the mean log-likelihood of a generator's response under a separate verifier model, as a label-free reward signal for RL post-training. CME is continuous, training-free, and grounded in the principle that responses a verifier finds unsurprising are likely correct or high quality. Because the verifier is independent of the generator, the signal cannot be gamed through self-consistency. We integrate CME into GRPO with no other changes to the training loop, extending label-free RL to open-ended instruction following -- a regime where self-referential signals are inapplicable or poorly suited. On open-ended instruction following (UltraFeedback prompts, evaluated on AlpacaEval 2.0), CME rewards beat the untrained base in head-to-head LLM-as-Judge comparisons across four model families (Qwen, Llama, Gemma, OLMo) and three training regimes (pretrained, SFT, and instruction-tuned), with tie-adjusted win rates ranging from 52.5% to 71.4%. Code will be released upon publication.

Summary / 总结

Post-training large language models with reinforcement learning is bottlenecked by the reward signal.

FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models

Authors: Yonatan Dukler, Guihong Li, Deval Shah, Jiang Liu, Vikram Appia, Emad Barsoum

First: 2025-11-14T17:25:14+00:00 · Latest: 2026-05-27T18:20:06+00:00

Comments: MLSys'26

Abs · PDF · Code1 · Code2

Abstract

Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain as capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy that is comparable with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged over a wide range of downstream evaluations. In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks. For inference, we demonstrate 32.6% speedup in Time To First Token when serving a converted DeepSeek-V3 architecture with expert parallelism in SGLang and achieve 97.3% communication-computation overlap during the prefill stage. During training, our approach enables 88.9% communication overlap of the all-to-all communication collectives when pre-training DeepSeek-V3 MoE layers with expert parallelism.

Summary / 总结

Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings.

History

20260528_0458 20260527_0450 20260526_0419 20260525_0400 20260524_0402 20260523_0428 20260522_0446 20260521_0457 20260520_0425 20260519_0421 20260518_0401 20260517_0357 20260516_0410 20260515_0424 20260514_0427 20260513_0428 20260512_0428 20260511_0359 20260510_0352 20260509_0406 20260508_0408 20260507_0417 20260506_0402 20260505_0410 20260504_0347 20260503_0348 20260502_0401 20260501_0405 20260430_0407 20260429_0410 20260428_0403 20260427_0340 20260426_0338 20260425_0344 20260424_0403 20260423_0402 20260422_0359 20260421_0355 20260420_0336 20260419_0335 20260418_0352 20260417_0357 20260416_0358 20260415_0400 20260414_0400 20260413_0333 20260412_0329 20260411_0337 20260410_0359 20260409_0354 20260408_0353 20260407_0346 20260406_0328 20260405_0325 20260404_0333 20260403_0343 20260401_0350 20260331_0350 20260330_0328 20260328_0336 20260327_0351 20260326_0341 20260325_0349 20260324_0342 20260323_0319 20260322_0318 20260321_0332 20260320_0341 20260319_0343 20260318_0350 20260317_0353 20260316_0322 20260315_0321 20260314_0326 20260313_0341 20260312_0337 20260311_0333 20260310_0335 20260309_0318 20260308_0315 20260307_0329 20260306_0349 20260305_0332 20260304_0334 20260303_0332 20260302_0317 20260228_2322 20260228_2259 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553