AI4Science 论文速递

Snapshot: 20260503_0348

Agentic Education: Using Claude Code to Teach Claude Code

Authors: Zain Naboulsi

First: 2026-04-19T14:20:09+00:00 · Latest: 2026-04-30T16:55:22+00:00

Comments: 27 pages, 5 figures, 7 tables. v2: added discussion of the GenAI adoption gap (MIT NANDA 2025) and a future-work direction on affect-aware adaptation; no changes to the system, evaluation, or core contributions. Code: https://github.com/zainnab-sparq/cc-self-train

Abs · PDF · Code1 · Code2 · Code3

Abstract

AI coding assistants have proliferated rapidly, yet structured pedagogical frameworks for learning these tools remain scarce. Developers face a gap between tool documentation and practical mastery, relying on fragmented resources such as blog posts, video tutorials, and trial-and-error. We present cc-self-train, a modular interactive curriculum for learning Claude Code, an agentic AI coding tool, through hands-on project construction. The system introduces five contributions: (1) a persona progression model that adapts instructor tone across four stages (Guide, Collaborator, Peer, Launcher), operationalizing Gradual Release of Responsibility for AI-mediated instruction; (2) an adaptive learning system that observes engagement quality through hook-based heuristics and adjusts scaffolding at two timescales, using streak detection for mid-module intervention and aggregate metrics for module-boundary persona changes; (3) a cross-domain unified curriculum in which five distinct project domains share identical feature sequencing, enabling transfer learning; (4) a step-pacing mechanism with explicit pause primitives to manage information overload in an AI-as-instructor context; and (5) an auto-updating curriculum design in which the onboarding agent detects upstream tool changes and updates teaching materials before instruction begins. A parametrized test suite enforces structural consistency as a proxy for pedagogical invariants across all 50 modules. A pilot evaluation with 27 participants shows statistically significant reported self-efficacy gains across all 10 assessed skill areas (p < 0.001), with the largest effects on advanced features such as hooks and custom skills. We discuss implications for the design of auto-updating educational systems.

Summary / 总结

AI coding assistants have proliferated rapidly, yet structured pedagogical frameworks for learning these tools remain scarce.

In-context Learning vs. Instruction Tuning: The Case of Small and Multilingual Language Models

Authors: David Ponce, Thierry Etchegoyhen

First: 2025-03-03T14:47:23+00:00 · Latest: 2026-04-30T14:25:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Instruction following is a critical ability for Large Language Models to perform downstream tasks. The standard approach to instruction tuning has relied on a specific phase of supervised fine-tuning over curated instruction datasets, optionally complemented with an alignment step over human preferences. Recent work has shown the potential of in-context learning (ICL) alternatives to guide base models towards instruction following. This type of approach is particularly relevant to circumvent the notable efforts and resources needed for supervised instruction tuning. In this work, we evaluate the viability of ICL for instruction following in scenarios where it is particularly relevant, i.e., languages other than English and across model sizes. Our results show that these scenarios result in downgraded ICL instruction following performance. We further show that applying Direct Preference Optimisation over base models can partially improve baseline results, although alternatives to current ICL instruction following will be needed to bridge the gap with larger English-centric language models.

Summary / 总结

Instruction following is a critical ability for Large Language Models to perform downstream tasks.

Post-Optimization Adaptive Rank Allocation for LoRA

Authors: Vishnuprasadh Kumaravelu, Sunil Gupta, P. K. Srijith

First: 2026-04-30T12:40:11+00:00 · Latest: 2026-04-30T12:40:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Exponential growth in the scale of modern foundation models has led to the widespread adoption of Low-Rank Adaptation (LoRA) as a parameter-efficient fine-tuning technique. However, standard LoRA implementations disregard the varying intrinsic dimensionality of model layers and enforce a uniform rank, leading to parameter redundancy. We propose Post-Optimization Adaptive Rank Allocation (PARA), a data-free compression method for LoRA that integrates seamlessly into existing fine-tuning pipelines. PARA leverages Singular Value Decomposition to prune LoRA ranks using a global threshold over singular values across all layers. This results in non-uniform rank allocation based on layer-wise spectral importance. As a post-hoc method, PARA circumvents the training modifications and resulting instabilities that dynamic architectures typically incur. We empirically demonstrate that PARA reduces parameter count by 75-90\% while preserving the predictive performance of the original, uncompressed LoRA across multiple vision and language benchmarks. Code will be published upon acceptance.

Summary / 总结

Exponential growth in the scale of modern foundation models has led to the widespread adoption of Low-Rank Adaptation (LoRA) as a parameter-efficient fine-tuning technique.

Improving Graph Few-shot Learning with Hyperbolic Space and Denoising Diffusion

Authors: Yonghao Liu, Jialu Sun, Wei Pang, Fausto Giunchiglia, Ximing Li, Xiaoyue Feng, Renchu Guan

First: 2026-04-30T06:02:16+00:00 · Latest: 2026-04-30T06:02:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Graph few-shot learning, which focuses on effectively learning from only a small number of labeled nodes to quickly adapt to new tasks, has garnered significant research attention. Despite recent advances in graph few-shot learning that have demonstrated promising performance, existing methods still suffer from several key limitations. First, during the meta-training phase, these methods typically perform node representation learning in Euclidean space, which often fails to capture the inherently hierarchical structure existing in real-world graph data. Second, during the meta-testing phase, they usually fit an empirical target distribution derived from only a few support samples, even when this distribution significantly deviates from the true underlying distribution. To address these issues, we propose IMPRESS, a novel framework that IMproves graPh few-shot learning with hypeRbolic spacE and denoiSing diffuSion. Specifically, our model learns node representations in a hyperbolic space and enriches the support distribution through denoising diffusion mechanisms. Theoretically, IMPRESS achieves a tighter generalization bound. Empirically, IMPRESS consistently outperforms competitive baselines across multiple benchmark datasets.

Summary / 总结

Graph few-shot learning, which focuses on effectively learning from only a small number of labeled nodes to quickly adapt to new tasks, has garnered significant research attention.

DeepWeightFlow: Re-Basined Flow Matching for Generating Neural Network Weights

Authors: Saumya Gupta, Scott Biggs, Moritz Laber, Zohair Shafi, Robin Walters, Ayan Paul

Venue: ICLR 2026

First: 2026-01-08T15:56:28+00:00 · Latest: 2026-04-30T05:26:28+00:00

Comments: 25 pages, 20 tables, 2 figures

Abs · PDF · Code1 · Code2

Abstract

Building efficient and effective generative models for neural network weights has been a research focus of significant interest that faces challenges posed by the high-dimensional weight spaces of modern neural networks and their symmetries. Several prior generative models are limited to generating partial neural network weights, particularly for larger models, such as ResNet and ViT. Those that do generate complete weights struggle with generation speed or require finetuning of the generated models. In this work, we present DeepWeightFlow, a Flow Matching model that operates directly in weight space to generate diverse and high-accuracy neural network weights for a variety of architectures, neural network sizes, and data modalities. The neural networks generated by DeepWeightFlow do not require fine-tuning to perform well and can scale to large networks. We apply Git Re-Basin and TransFusion for neural network canonicalization in the context of generative weight models to account for the impact of neural network permutation symmetries and to improve generation efficiency for larger model sizes. The generated networks excel at transfer learning, and ensembles of hundreds of neural networks can be generated in minutes, far exceeding the efficiency of diffusion-based methods. DeepWeightFlow models pave the way for more efficient and scalable generation of diverse sets of neural networks.

Summary / 总结

Building efficient and effective generative models for neural network weights has been a research focus of significant interest that faces challenges posed by the high-dimensional weight spaces of modern neural networks and their symmetries.

BoostLoRA: Growing Effective Rank by Boosting Adapters

Authors: Raviteja Anantha, Nick Levato, Layne C. Price

First: 2026-04-30T01:43:00+00:00 · Latest: 2026-04-30T01:43:00+00:00

Comments: Preprint. Under review

Abs · PDF · Code1 · Code2

Abstract

Parameter-efficient fine-tuning (PEFT) methods face a tradeoff between adapter size and expressivity: ultra-low-parameter adapters are confined to fixed low-rank subspaces, capping performance even with extended training. We propose BoostLoRA, a gradient-boosting framework that overcomes this limit by iteratively training and merging minimal adapters on the examples the current model gets wrong. A ROTATE SVD basis strategy assigns each round to an orthogonal subspace, so cumulative effective rank grows linearly with the number of rounds while each adapter remains ultra-low-rank. After merging, adapters are discarded, leaving zero inference overhead. On Qwen2.5-3B, BoostLoRA reaches 89.1% on GSM8K and 68.8% on MATH-500, surpassing both the best single-shot ultra-low parameter adapter (TinyLoRA) and full fine-tuning; on code generation it reaches 57.2% on MBPP and 80.4% on HumanEval while full fine-tuning drops below the zero-shot baseline. We also demonstrate cross-architecture transfer on protein binding classification with ESM2-650M and cross-entropy training. BoostLoRA is, to our knowledge, the first PEFT method whose effective rank grows with training, separating per-round parameter cost from total representational capacity.

Summary / 总结

Parameter-efficient fine-tuning (PEFT) methods face a tradeoff between adapter size and expressivity: ultra-low-parameter adapters are confined to fixed low-rank subspaces, capping performance even with extended training.

Learning Rate Engineering: From Coarse Single Parameter to Layered Evolution

Authors: Ming-Hong Yao, Di Wang, Jian Cui, Jin-Yan Chen, Zi-Hao Cui, Fa Wang, Chen Wei, Qiu-Ye Yu

First: 2026-04-30T01:13:48+00:00 · Latest: 2026-04-30T01:13:48+00:00

Comments: 16 pages, 5 figures, 3 tables

Abs · PDF · Code1 · Code2

Abstract

Learning rate scheduling has evolved from the single global fixed rate of early SGD to sophisticated layer-wise adaptive strategies. We systematize this evolution into five generations: (Gen1) global fixed learning rates, (Gen2) global scheduling, (Gen3) parameter-level adaptation, (Gen4) layer-level differentiation, and (Gen5) joint layer-time scheduling. We trace the fundamental motivation behind each transition, showing how the shift from one-size-fits-all to tailoring by layer and time addresses the impossible trinity of transfer learning: lower layers require small updates to preserve general knowledge while higher layers need large updates to adapt to new tasks. Building on this taxonomy, we propose Discriminative Adaptive Layer Scaling (DALS), a unified framework that integrates phase-adaptive cosine scheduling, depth-aware Grokfast gradient filtering, and LARS-style trust ratios into a single coherent optimizer. We benchmark 18 strategies including three DALS variants across all five generations on five datasets: synthetic, CIFAR-10 (from scratch), RTE, TREC-6, and IMDb (fine-tuning). On synthetic, DALS achieves the best accuracy at 98.0%, while DALS-Fast reaches 90% in just 3 epochs. The cross-dataset analysis reveals striking regime-dependent patterns -- no single strategy wins across all regimes. Critically, STLR+Discriminative, the ULMFiT champion, catastrophically fails on from-scratch tasks (43.6% on TREC-6 from scratch vs. 96.8% with RAdam), confirming that directional decay biases are harmful without pretrained features. DALS avoids either extreme, achieving the best synthetic result while maintaining competitive fine-tuning performance.

Summary / 总结

Learning rate scheduling has evolved from the single global fixed rate of early SGD to sophisticated layer-wise adaptive strategies.

Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation

Authors: Jon-Paul Cacioli

First: 2026-04-29T22:48:24+00:00 · Latest: 2026-04-29T22:48:24+00:00

Comments: 12 pages, 3 figures, 3 tables. Pre-registered on OSF (osf.io/7p64)

Abs · PDF · Code1 · Code2

Abstract

When instructed to underperform on multiple-choice evaluations, do language models engage with question content or fall back on positional shortcuts? We map the boundary between these regimes using a six-condition adversarial instruction-specificity gradient administered to two instruction-tuned LLMs (Llama-3-8B and Llama-3.1-8B) on 2,000 MMLU-Pro items. Distributional screening (response-position entropy) and an independent content-engagement criterion (difficulty-accuracy correlation) jointly characterise each condition. The gradient reveals three regimes rather than a monotonic transition. Vague adversarial instructions produce moderate accuracy reduction with preserved content engagement. Standard sandbagging and capability-imitation instructions produce positional entropy collapse with partial content engagement. A two-step answer-aware avoidance instruction produces extreme positional collapse, with near-total concentration on a single response position (99.9% and 87.4%) and no measurable content sensitivity. This was the only multi-step instruction tested, and it produced the most extreme shortcut. The attractor position matches each model's content-absent null-prompt default. The effect replicates across both models and four academic domains. Distributional collapse and content engagement can co-occur (50% concordance between screening criteria), indicating that entropy-based screening and difficulty-based content assessment capture partially independent dimensions of response validity. Results suggest that instruction complexity can determine whether adversarial compliance uses content-aware or content-blind mechanisms in small instruction-tuned LLMs under greedy decoding.

Summary / 总结

When instructed to underperform on multiple-choice evaluations, do language models engage with question content or fall back on positional shortcuts?

Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents

Authors: Anh Ta, Junjie Zhu, Shahin Shayandeh

First: 2026-04-29T22:09:47+00:00 · Latest: 2026-04-29T22:09:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Tool-calling agents are evaluated on tool selection, parameter accuracy, and scope recognition, yet LLM trajectory assessments remain inherently post-hoc. Disconnected from the active execution loop, such assessments identify errors that are usually addressed through prompt-tuning or retraining, and fundamentally cannot course-correct the agent in real time. To close this gap, we move evaluation into the execution loop at inference time: a specialized reviewer agent evaluates provisional tool calls prior to execution, shifting the paradigm from post-hoc recovery to proactive evaluation and error mitigation. In practice, this architecture establishes a clear separation of concerns between the primary execution agent and a secondary review agent. As with any multi-agent system, the reviewer can introduce new errors while correcting others, yet no prior work to our knowledge has systematically measured this tradeoff. To quantify this tradeoff, we introduce Helpfulness-Harmfulness metrics: helpfulness measures the percentage of base agent errors that feedback corrects; harmfulness measures the percentage of correct responses that feedback degrades. These metrics directly inform reviewer design by revealing whether a given model or prompt provides net positive value. We evaluate our approach on BFCL (single-turn) and Tau2-Bench (multi-turn stateful scenarios), achieving +5.5% on irrelevance detection and +7.1% on multi-turn tasks. Our metrics reveal that reviewer model choice is critical: the reasoning model o3-mini achieves a 3:1 benefit-to-risk ratio versus 2.1:1 for GPT-4o. Automated prompt optimization via GEPA provides an additional +1.5-2.8%. Together, these results demonstrate a core advantage of separating execution and review: the reviewer can be systematically improved through model selection and prompt optimization, without retraining the base agent.

Summary / 总结

Tool-calling agents are evaluated on tool selection, parameter accuracy, and scope recognition, yet LLM trajectory assessments remain inherently post-hoc.

Zero-Shot to Full-Resource: Cross-lingual Transfer Strategies for Aspect-Based Sentiment Analysis

Authors: Jakob Fehle, Nils Constantin Hellwig, Udo Kruschwitz, Christian Wolff

First: 2026-04-29T12:45:25+00:00 · Latest: 2026-04-29T12:45:25+00:00

Abs · PDF · Code1 · Code2

Abstract

Aspect-based Sentiment Analysis (ABSA) extracts fine-grained opinions toward specific aspects within text but remains largely English-focused despite major advances in transformer-based and instruction-tuned models. This work presents a multilingual evaluation of state-of-the-art ABSA approaches across seven languages (English, German, French, Dutch, Russian, Spanish, and Czech) and four subtasks (ACD, ACSA, TASD, ASQP). We systematically compare different transformer architectures under zero-resource, data-only, and full-resource settings, using cross-lingual transfer, code-switching and machine translation. Fine-tuned Large Language Models (LLMs) achieve the highest overall scores, particularly in complex generative tasks, while few-shot counterparts approach this performance in simpler setups, where smaller encoder models also remain competitive. Cross-lingual training on multiple non-target languages yields the strongest transfer for fine-tuned LLMs, while smaller encoder or seq-to-seq models benefit most from code-switching, highlighting architecture-specific strategies for multilingual ABSA. We further contribute two new German datasets, an adapted GERestaurant and the first German ASQP dataset (GERest), to encourage multilingual ABSA research beyond English.

Summary / 总结

Aspect-based Sentiment Analysis (ABSA) extracts fine-grained opinions toward specific aspects within text but remains largely English-focused despite major advances in transformer-based and instruction-tuned models.

Advancing multi-site emission control: A physics-informed transfer learning framework with mixture of experts for carbon-pollutant synergy

Authors: Yuxuan Ying, Hanqing Yang, Kaige Wang, Yu Hu, Zhiming Zheng, Yunliang Jiang, Xiaoqing Lin, Xiaodong Li, Jun Chen

First: 2026-04-29T11:54:18+00:00 · Latest: 2026-04-29T11:54:18+00:00

Comments: Supplementary materials will be released after the final version is finalized

Abs · PDF · Code1 · Code2

Abstract

Municipal solid waste incineration is increasingly central to urban waste management, yet its sustainability benefit depends on controlling carbon emissions and multiple air pollutants under highly heterogeneous operating conditions. Current data-driven models are often accurate within individual plants but are difficult to transfer across facilities, limiting their value for scalable emission-control strategies. Here we show that multi-site emission behaviour can be represented through transferable system-level structures when physical constraints, operating-regime heterogeneity and carbon--pollutant coupling are jointly considered. We develop a physics-informed transfer learning framework built on a carbon--pollutant mixture-of-experts model, which combines regime-dependent expert routing with conservation-based regularization and a carbon--pollutant synergistic index for integrated risk evaluation. Across 13 municipal solid waste incineration plants, the model captured both pollutant-specific emissions and system-level risk, achieving source-domain average pollutant $R^2$ values of 0.668--0.904 and CPSI $R^2$ values of 0.666--0.970. After transfer from a reference facility to 12 target plants, average pollutant $R^2$ remained between 0.661 and 0.842, while CPSI retained comparable transferability ($R^2$ = 0.610--0.841). Expert-utilization patterns further indicate that adaptation occurs through structured re-weighting of operating regimes rather than complete model re-learning. By extending the learned representation into an interpretable digital twin, this framework provides a route from emission prediction to regime-aware operational navigation, supporting scalable carbon--pollutant synergistic control across heterogeneous waste-to-energy systems.

Summary / 总结

Municipal solid waste incineration is increasingly central to urban waste management, yet its sustainability benefit depends on controlling carbon emissions and multiple air pollutants under highly heterogeneous operating conditions.

ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers

Authors: Fotios Lygerakis, Ozan Özdenizci, Elmar Rückert

First: 2025-05-26T14:19:29+00:00 · Latest: 2026-04-29T11:05:46+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-stage spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based architecture for learning task-agnostic visuotactile representations from paired vision and tactile inputs. Our key idea is a two-stage positional injection: local (modality-specific) positional encodings are added within each stream, and a global positional encoding is added on the joint token sequence immediately before attention, providing a shared positional vocabulary at the stage where cross-modal interaction occurs. We make the positional injection points explicit and conduct controlled ablations that isolate their effect before a token-wise nonlinearity versus immediately before self-attention. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of \emph{ViTaPEs} in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success. Project page: https://sites.google.com/view/vitapes

Summary / 总结

Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force.

TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation

Authors: Lin Sun, Guangxiang Zhao, Xiaoqi Jian, Yuhan Wu, Weihong Lin, Yongfu Zhu, Qilong Shi, Change Jia, Aomufei Yuan, Yuxuan Tian, Linglin Zhang, Jinzhu Wu, Junfeng Ran, Sai-er Hu, Zihan Jiang, Junting Zhou, Wenrui Liu, Xusen Xiao, Bin Cui, Tong Yang, Xiangzheng Zhang

First: 2025-03-06T16:25:53+00:00 · Latest: 2026-04-29T08:59:56+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

The challenge of reducing the size of Large Language Models (LLMs) while maintaining their performance has gained significant attention. However, existing methods, such as model distillation and transfer learning, often fail to achieve high accuracy. To address this limitation, we introduce the Branch-Merge distillation approach, which enhances model compression through two phases: (1) the Branch Phase, where knowledge from a large teacher model is \textit{selectively distilled} into specialized student models via domain-specific supervised fine-tuning (SFT); And (2) the Merge Phase, where these student models are merged to enable cross-domain knowledge transfer and improve generalization. We validate our distillation approach using DeepSeek-R1 as the teacher and DeepSeek-R1-Distill-Qwen-32B as the student. The resulting merged model, TinyR1-32B-Preview, outperforms its counterpart DeepSeek-R1-Distill-Qwen-32B across multiple benchmarks, including Mathematics (+5.5 points), Coding (+4.4 points) and Science (+2.9 points), while achieving near-equal performance to DeepSeek-R1 on AIME 2024. The Branch-Merge distillation approach provides a scalable solution for creating smaller, high-performing LLMs with reduced computational cost and time.

Summary / 总结

The challenge of reducing the size of Large Language Models (LLMs) while maintaining their performance has gained significant attention.

Meta-Learning and Targeted Differential Privacy to Improve the Accuracy-Privacy Trade-off in Recommendations

Authors: Peter Müllner, Dominik Kowald, Markus Schedl, Elisabeth Lex

First: 2026-04-29T08:00:20+00:00 · Latest: 2026-04-29T08:00:20+00:00

Comments: Accepted at LBR@UMAP'26

Abs · PDF · Code1 · Code2

Abstract

Balancing differential privacy (DP) with recommendation accuracy is a key challenge in privacy-preserving recommender systems, since DP-noise degrades accuracy. We address this trade-off at both the data and model levels. At the data level, we apply DP only to the most stereotypical user data likely to reveal sensitive attributes, such as gender or age, to reduce unnecessary perturbation; we refer to this as targeted DP. At the model level, we use meta-learning to improve robustness to remaining DP-noise. This achieves a better trade-off between accuracy and privacy than standard approaches: Meta-learning improves accuracy and targeted DP leads to lower empirical privacy risk compared to uniformly applied DP and full DP baselines. Overall, our findings show that selectively applying DP at the data level together with meta-learning at the model level can effectively balance recommendation accuracy and user privacy.

Summary / 总结

Balancing differential privacy (DP) with recommendation accuracy is a key challenge in privacy-preserving recommender systems, since DP-noise degrades accuracy.

Adaptive and Fine-grained Module-wise Expert Pruning for Efficient LoRA-MoE Fine-Tuning

Authors: Weihang Li, Jianchun Liu, Hongli Xu

First: 2026-04-29T06:45:44+00:00 · Latest: 2026-04-29T06:45:44+00:00

Abs · PDF · Code1 · Code2

Abstract

LoRA-MoE has emerged as an effective paradigm for parameter-efficient fine-tuning, combining the low training cost of LoRA with the increased adaptation capacity of Mixture-of-Experts (MoE). However, existing LoRA-MoE frameworks typically adopt a fixed and uniform expert configuration across heterogeneous Transformer modules (\eg, attention query/key projections and MLP gating networks), ignoring their distinct functional roles and capacity requirements. This design leads to localized over-provisioning, redundant trainable parameters, and unnecessary optimizer-state overhead. Moreover, prior methods enforce load balancing among experts throughout training. Although beneficial in the early stage, this constraint becomes restrictive once routing patterns stabilize, limiting expert specialization on downstream tasks. In this paper, we propose DMEP, a novel LoRA-MoE fine-tuning framework based on Dynamic Module-wise Expert Pruning. DMEP tracks expert utilization during training and physically removes low-utility experts on a per-module basis, yielding a more compact expert structure tailored to different modules. The pruned model then continues training without the load-balancing constraint, freeing the remaining experts to focus entirely on the downstream task and develop specialized expertise. By jointly adapting module-wise expert capacity and eliminating unnecessary balancing, DMEP improves both parameter efficiency and training efficiency. Extensive experiments on multiple reasoning benchmarks show that DMEP reduces trainable parameters by 35\%--43\% and improves training throughput by about 10\%, while maintaining or surpassing the downstream reasoning accuracy of uniform LoRA-MoE baselines.

Summary / 总结

LoRA-MoE has emerged as an effective paradigm for parameter-efficient fine-tuning, combining the low training cost of LoRA with the increased adaptation capacity of Mixture-of-Experts (MoE).

ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images

Authors: Xinyue Li, Zhiming Xu, Min Tang, Zhaolin Cai, Sijing Wu, Xiongkuo Min, Yitong Chen, Guangtao Zhai

First: 2026-02-03T14:04:51+00:00 · Latest: 2026-04-29T03:49:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Generative text-to-image models are advancing at an unprecedented pace, continuously shifting the perceptual quality ceiling and rendering previously collected labels unreliable for newer generations. To address this, we present ELIQ, a Label-free Framework for Quality Assessment of Evolving AI-generated Images. Specifically, ELIQ focuses on visual quality and prompt-image alignment, automatically constructs positive and aspect-specific negative pairs to cover both conventional distortions and AIGC-specific distortion modes, enabling transferable supervision without human annotations. Building on these pairs, ELIQ adapts a pre-trained multimodal model into a quality-aware critic via instruction tuning and predicts two-dimensional quality using lightweight gated fusion and a Quality Query Transformer. Experiments across multiple benchmarks demonstrate that ELIQ consistently outperforms existing label-free methods, generalizes from AI-generated content (AIGC) to user-generated content (UGC) scenarios without modification, and paves the way for scalable and label-free quality assessment under continuously evolving generative models. The code will be released upon publication.

Summary / 总结

Generative text-to-image models are advancing at an unprecedented pace, continuously shifting the perceptual quality ceiling and rendering previously collected labels unreliable for newer generations.

Compositional Meta-Learning for Mitigating Task Heterogeneity in Physics-Informed Neural Networks

Authors: Beomchul Park, Minsu Koh, Heejo Kong, Seong-Whan Lee

First: 2026-04-29T03:09:57+00:00 · Latest: 2026-04-29T03:09:57+00:00

Comments: Accepted by Pattern Recognition

Abs · PDF · Code1 · Code2

Abstract

Physics-informed neural networks (PINNs) approximate solutions of partial differential equations (PDEs) by embedding physical laws into the loss function. In parameterized PDE families, variations in coefficients or boundary/initial conditions define distinct tasks. This makes training individual PINNs for each task computationally prohibitive, while cross-task transfer can be sensitive to task heterogeneity. While meta-learning can reduce retraining cost, existing methods often rely on a single global initialization and may suffer from negative transfer, particularly under feature-scarce coordinate inputs and limited training-task availability. We propose the Learning-Affinity Adaptive Modular Physics-Informed Neural Network (LAM-PINN), a compositional framework that leverages task-specific learning dynamics. LAM-PINN combines PDE parameters with learning-affinity metrics from brief transfer sessions to construct a task representation and cluster tasks even with coordinate-only inputs. It decomposes the model into cluster-specialized subnetworks and a shared meta network, and learns routing weights to selectively reuse modules instead of relying on a single global initialization. Across three PDE benchmarks, LAM-PINN achieves an average 19.7-fold reduction in mean squared error (MSE) on unseen tasks using only 10% of the training iterations required by conventional PINNs. These results indicate its effectiveness for generalization to unseen configurations within bounded design spaces of parameterized PDE families in resource-constrained engineering settings.

Summary / 总结

Physics-informed neural networks (PINNs) approximate solutions of partial differential equations (PDEs) by embedding physical laws into the loss function.

DepthPilot: From Controllability to Interpretability in Colonoscopy Video Generation

Authors: Junhu Fu, Ke Chen, Weidong Guo, Shuyu Liang, Jie Xu, Chen Ma, Kehao Wang, Shengli Lin, Zeju Li, Yuanyuan Wang, Yi Guo, Shuo Li

First: 2026-04-29T02:24:28+00:00 · Latest: 2026-04-29T02:24:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Controllable medical video generation has achieved remarkable progress, but it still lacks interpretability, which requires the alignment of generated contents with physical priors and faithful clinical manifestations. To push the boundaries from mere controllability to interpretability, we propose DepthPilot, the first interpretable framework for colonoscopy video generation. This work takes a step toward trustworthy generation through two synergistic paradigms. To achieve explicit geometric grounding, DepthPilot devises a prior distribution alignment strategy, injecting depth constraints into the diffusion backbone via parameter-efficient fine-tuning to ensure anatomical fidelity. To enhance intrinsic nonlinear modeling under these geometric constraints, DepthPilot employs an adaptive spline denoising module, replacing fixed linear weights with learnable spline functions to capture complex spatio-temporal dynamics. Extensive evaluations across three public datasets and in-house clinical data confirm DepthPilot's robust ability to produce physically consistent videos. It achieves FID scores below 15 across all benchmarks and ranks first in clinician assessments, bridging the gap between "visually realistic" and "clinically interpretable". Moreover, DepthPilot-generated videos are expected to enable reliable 3D reconstruction, facilitating surgical navigation and blind region identification, and serve as a foundation toward the colorectal world model.

Summary / 总结

Controllable medical video generation has achieved remarkable progress, but it still lacks interpretability, which requires the alignment of generated contents with physical priors and faithful clinical manifestations.

Training-Free Adaptation of New-Generation LLMs using Legacy Clinical Models

Authors: Sasha Ronaghi, Chloe Stanwyck, Asad Aali, Amir Ronaghi, Miguel Fuentes, Tina Hernandez-Boussard, Emily Alsentzer

First: 2026-01-06T21:23:47+00:00 · Latest: 2026-04-28T22:36:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Adapting language models to the clinical domain through continued pretraining and instruction tuning requires costly retraining for each new model generation. We propose Cross-Architecture Proxy Tuning (CAPT), a model-ensembling approach that enables training-free adaptation of state-of-the-art general-domain models using existing clinical models. CAPT supports models with disjoint vocabularies, leveraging contrastive decoding to selectively inject clinically relevant signals while preserving the general-domain model's reasoning and fluency. On six clinical classification and text-generation tasks, CAPT with a new-generation general-domain model and an older-generation clinical model consistently outperforms both models individually and state-of-the-art ensembling approaches (average +17.6\% over UniTE, +41.4\% over proxy tuning across tasks). Through token-level analysis and physician case studies, we demonstrate that CAPT amplifies clinically actionable language, reduces context errors, and increases clinical specificity. This technique especially benefits healthcare institutions with constrained computational capacity that cannot support iterative clinical training and want to adopt emerging general-domain model advances.

Summary / 总结

Adapting language models to the clinical domain through continued pretraining and instruction tuning requires costly retraining for each new model generation.

Graph Property Inference in Small Language Models: Effects of Representation and Reasoning Strategy

Authors: Michal Podstawski

First: 2026-02-24T13:46:35+00:00 · Latest: 2026-04-28T19:02:35+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent progress in language modeling has expanded the range of tasks that can be approached through natural language interfaces, including problems that require structured reasoning. However, it remains unclear how effectively limited-capacity language models can infer formal properties of relational structures when those structures are presented in textual form. We conduct a systematic study of graph-theoretic property inference in small instruction-tuned language models, isolating the roles of input representation and reasoning strategy. Across a diverse set of local and global graph metrics evaluated on three models, we find that small language models fail to achieve reliable graph property estimation: normalized errors consistently exceed the intrinsic dispersion of target properties, and rank correlations remain weak across all configurations. However, the failure is structured rather than uniform. Adjacency-list encodings consistently reduce error and improve ordinal consistency relative to edge-lists, and multi-branch reasoning yields measurable aggregate gains across configurations. These results show that without task-specific fine-tuning or architectural adaptation, graph property inference in pretrained small language models remains fundamentally unreliable, but that representational organization and inference design produce consistent differences. The findings characterize the conditions under which structured inference degrades and identify which design choices yield improvements even under constrained model capacity.

Summary / 总结

Recent progress in language modeling has expanded the range of tasks that can be approached through natural language interfaces, including problems that require structured reasoning.

Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing

Authors: Keita Broadwater

First: 2026-02-12T10:09:13+00:00 · Latest: 2026-04-28T16:38:37+00:00

Comments: 23 pages, 9 figures; editorial and LaTeX revisions for clarity; improved presentation of methodology and results; updated figures, tables, and float placement; clarified temperature sensitivity and deployment-risk analysis; expanded reporting from the same experiments; results unchanged in substance

Abs · PDF · Code1 · Code2

Abstract

Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety through breadth-oriented evaluation across diverse tasks and risk categories. However, real-world deployment often exposes a different class of risk: operational failures that arise under repeated inference on identical or near-identical prompts rather than from broad task-level underperformance. In high-stakes settings, response consistency and safety under sustained use are therefore critical. We introduce Accelerated Prompt Stress Testing (APST), a depth-oriented evaluation framework inspired by highly accelerated stress testing in reliability engineering. APST repeatedly samples identical prompts under controlled operational conditions (such as decoding temperature) to surface latent failure modes including hallucinations, refusal inconsistency, and unsafe completions. Rather than treating failures as isolated events, APST models them as stochastic outcomes of repeated inference and uses Bernoulli and binomial formulations to estimate per-inference failure probabilities. Applying APST to multiple instruction-tuned LLMs evaluated on AIR-BENCH 2024--derived safety and security prompts, we find that models with comparable shallow-evaluation scores can exhibit substantially different empirical failure rates under repeated sampling. These results show that single-sample or low-depth evaluation can obscure meaningful differences in deployment-relevant reliability. APST complements existing benchmark methodologies by providing a practical framework for estimating failure frequency under sustained use and comparing safety reliability across models and decoding configurations.

Summary / 总结

Traditional benchmarks for large language models (LLMs), such as HELM and AIR-BENCH, primarily assess safety through breadth-oriented evaluation across diverse tasks and risk categories.

Adaptive Meta-Learning Stochastic Gradient Hamiltonian Monte Carlo Simulation for Bayesian Updating of Structural Dynamic Models

Authors: Xianghao Meng, James L. Beck, Yong Huang, Hui Li

Venue: Comput Meth Appl Mech Eng; 437: 117753 (2025)

First: 2026-04-28T14:34:48+00:00 · Latest: 2026-04-28T14:34:48+00:00

Abs · PDF · Code1 · Code2

Abstract

In the last few decades, Markov chain Monte Carlo (MCMC) methods have been widely applied to Bayesian updating of structural dynamic models in the field of structural health monitoring. Recently, several MCMC algorithms have been developed that incorporate neural networks to enhance their performance for specific Bayesian model updating problems. However, a common challenge with these approaches lies in the fact that the embedded neural networks often necessitate retraining when faced with new tasks, a process that is time-consuming and significantly undermines the competitiveness of these methods. This paper introduces a newly developed adaptive meta-learning stochastic gradient Hamiltonian Monte Carlo (AM-SGHMC) algorithm. The idea behind AM-SGHMC is to optimize the sampling strategy by training adaptive neural networks, and due to the adaptive design of the network inputs and outputs, the trained sampler can be directly applied to various Bayesian updating problems of the same type of structure without further training, thereby achieving meta-learning. Additionally, practical issues for the feasibility of the AM-SGHMC algorithm for structural dynamic model updating are addressed, and two examples involving Bayesian updating of multi-story building models with different model fidelity are used to demonstrate the effectiveness and generalization ability of the proposed method.

Summary / 总结

In the last few decades, Markov chain Monte Carlo (MCMC) methods have been widely applied to Bayesian updating of structural dynamic models in the field of structural health monitoring.

Health System Scale Semantic Search Across Unstructured Clinical Notes

Authors: Faith Wavinya Mutinda, Spandana Makeneni, Anna Lin, Shivaji Dutta, Irit R. Rasooly, Patrick Dibussolo, Shivani Kamath Belman, Hessam Shahriari, Kevin Murphy, Alex B. Ruan, Barbara H. Chaiyachati, Sanjay Chainani, Robert W. Grundmeier, Scott M. Haag, Jeffrey M. Miller, Heather M. Griffis, Ian M. Campbell

First: 2026-04-28T13:09:48+00:00 · Latest: 2026-04-28T13:09:48+00:00

Comments: for associated code, see https://github.com/Ian-Campbell-Lab/clinical-semantic-search

Abs · PDF · Code1 · Code2 · Code3

Abstract

Introduction: Semantic search, which retrieves documents based on conceptual similarity rather than keyword matching, offers substantial advantages for retrieval of clinical information. However, deploying semantic search across entire health systems, comprising hundreds of millions of clinical notes, presents formidable engineering, cost, and governance challenges that have prevented adoption. Methods: We deployed a semantic search system at a large children's hospital indexing 166 million clinical notes (484 million vectors) from 1.68 million patients. The system uses instruction-tuned qwen3-embedding-0.6B embeddings, stores vectors in a managed database with storage-optimized indexing, maintains full-text metadata in a low-latency key-value store, and operates within a HIPAA-compliant governance framework. We evaluated the system through three experiments: optimization of embedding model and chunking strategy using a physician-authored benchmark dataset, characterization of full-scale performance (cost, latency, retrieval quality), and clinical utility assessment via comparison of chart abstraction efficiency across three tasks. Results: The system delivers sub-second query latency (median 237 ms single-user, 451 ms 20-user concurrency) with monthly costs of approximately USD 4,000. Qwen3 embeddings with 300-token chunk size achieved 94.6% accuracy on a clinical question-answering benchmark. In clinical utility evaluation across three abstraction tasks, semantic search reduced time-to-completion by 24 to 89% compared to clinician-performed chart review while maintaining comparable inter-rater agreement. Conclusion: Health-system-scale semantic search is both technically and operationally feasible. The system provides infrastructure supporting interactive search, cohort generation, and downstream LLM-powered clinical applications without requiring specialized informatics expertise.

Summary / 总结

Introduction: Semantic search, which retrieves documents based on conceptual similarity rather than keyword matching, offers substantial advantages for retrieval of clinical information.

FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices

Authors: Changyu Li, Shuanghong Huang, Jiashen Liu, Ming Lei, Jidu Xing, Kaishun Wu, Lu Wang, Fei Luo

First: 2026-04-28T09:29:41+00:00 · Latest: 2026-04-28T09:29:41+00:00

Comments: 19 pages, 15 figures

Abs · PDF · Code1 · Code2

Abstract

Federated fine-tuning provides a practical route to adapt large language models (LLMs) on edge devices without centralizing private data, yet in mobile deployments the training wall-clock is often bottlenecked by straggler-limited uplink communication under heterogeneous bandwidth and intermittent participation. Although parameter-efficient fine-tuning (PEFT) reduces trainable parameters, per-round payloads remain prohibitive in non-IID regimes, where uniform compression can discard rare but task-critical signals. We propose Fed-FSTQ, a Fisher-guided token quantization system primitive for communication-efficient federated LLM fine-tuning. Fed-FSTQ employs a lightweight Fisher proxy to estimate token sensitivity, coupling importance-aware token selection with non-uniform mixed-precision quantization to allocate higher fidelity to informative evidence while suppressing redundant transmission. The method is model-agnostic, serves as a drop-in module for standard federated PEFT pipelines, e.g., LoRA, without modifying the server aggregation rule, and supports bandwidth-heterogeneous clients via compact sparse message packing. Experiments on multilingual QA and medical QA under non-IID partitions show that Fed-FSTQ reduces cumulative uplink traffic required to reach a fixed quality threshold by 46x relative to a standard LoRA baseline, and improves end-to-end wall-clock time-to-accuracy by 52%. Furthermore, enabling Fisher-guided token reduction at inference yields up to a 1.55x end-to-end speedup on NVIDIA Jetson-class edge devices, demonstrating deployability under tight resource constraints.

Summary / 总结

Federated fine-tuning provides a practical route to adapt large language models (LLMs) on edge devices without centralizing private data, yet in mobile deployments the training wall-clock is often bottlenecked by straggler-limited uplink communication under heterogeneous bandwidth and intermittent participation.

On Finding Small Hyper-Gradients in Bilevel Optimization: Hardness Results and Improved Analysis

Authors: Lesi Chen, Jing Xu, Jingzhao Zhang

First: 2023-01-02T15:09:12+00:00 · Latest: 2026-04-28T06:26:34+00:00

Comments: Published in COLT 2024. This arXiv version refines Assumption 4.1 (d); adds discussions on related works in Appendix A; and corrects the kappa dependency in the upper bounds

Abs · PDF · Code1 · Code2

Abstract

Bilevel optimization reveals the inner structure of otherwise oblique optimization problems, such as hyperparameter tuning, neural architecture search, and meta-learning. A common goal in bilevel optimization is to minimize a hyper-objective that implicitly depends on the solution set of the lower-level function. Although this hyper-objective approach is widely used, its theoretical properties have not been thoroughly investigated in cases where the lower-level functions lack strong convexity. In this work, we first provide hardness results to show that the goal of finding stationary points of the hyper-objective for nonconvex-convex bilevel optimization can be intractable for zero-respecting algorithms. Then we study a class of tractable nonconvex-nonconvex bilevel problems when the lower-level function satisfies the Polyak-Łojasiewicz (PL) condition. We show a simple first-order algorithm can achieve better complexity bounds of $\tilde{\mathcal{O}}(ε^{-2})$, $\tilde{\mathcal{O}}(ε^{-4})$ and $\tilde{\mathcal{O}}(ε^{-6})$ in the deterministic, partially stochastic, and fully stochastic setting respectively. The complexities in the first two cases are optimal up to logarithmic factors.

Summary / 总结

Bilevel optimization reveals the inner structure of otherwise oblique optimization problems, such as hyperparameter tuning, neural architecture search, and meta-learning.

Below-Chance Blindness: Prompted Underperformance in Small LLMs Produces Positional Bias Rather than Answer Avoidance

Authors: Jon-Paul Cacioli

First: 2026-04-28T05:57:23+00:00 · Latest: 2026-04-28T05:57:23+00:00

Comments: 10 pages, 2 figures, 2 tables. Pre-registered: https://osf.io/6zftv/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Detecting sandbagging--the deliberate underperformance on capability evaluations--is an open problem in AI safety. We tested whether symptom validity testing (SVT) logic from clinical malingering detection could identify sandbagging through below-chance performance (BCB) on forced-choice items. In a pre-registered pilot at the 7-9 billion parameter instruction-tuned scale (3 models, 4 MMLU-Pro domains, 4 conditions, 500 items per cell, 24,000 total trials), the plausibility gate failed. Zero of 12 model-domain cells showed significant below-chance performance under sandbagging instruction. Exploratory analyses revealed three qualitatively distinct failure modes. Qwen-2.5-7B and Phi-3.5-mini largely ignored the sandbagging instruction, with 62-88% response identity with the honest baseline. Llama-3-8B complied substantially but implemented underperformance as a positional heuristic, collapsing its response distribution onto middle-alphabet options (E at 31.8%, F at 26.1%) regardless of where the correct answer fell. This produced accuracy boosts of up to 33 percentage points when the correct answer coincidentally occupied the model's preferred position. An explicit anti-task instruction ("pick the least likely answer") drove two of three models below chance, with accuracy as low as 0.024. The capability for answer-aware avoidance therefore exists but is not activated by "deliberately underperform." BCB did not fail as a logical marker of answer-aware avoidance. It was not observed in this regime because the model showing the largest behavioural shift exhibited behaviour consistent with a position-dominant response policy rather than content-aware answer avoidance. We propose that positional-distribution shift may be a more effective behavioural signature than below-chance accuracy for detecting prompted underperformance at this model scale.

Summary / 总结

Detecting sandbagging--the deliberate underperformance on capability evaluations--is an open problem in AI safety.

Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal

Authors: Nirmalendu Prakash, Yeo Wei Jie, Amir Abdullah, Ranjan Satapathy, Erik Cambria, Roy Ka Wei Lee

First: 2025-09-07T02:29:07+00:00 · Latest: 2026-04-28T03:29:39+00:00

Abs · PDF · Code1 · Code2

Abstract

Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood. We study two public instruction-tuned models, Gemma-2-2B-IT and LLaMA-3.1-8B-IT, using sparse autoencoders (SAEs) trained on residual-stream activations. Given a harmful prompt, we search the SAE latent space for feature sets whose ablation flips the model from refusal to compliance, demonstrating causal influence and creating a jailbreak. Our search proceeds in three stages: (1) Refusal Direction: find a refusal-mediating direction and collect SAE features near that direction; (2) Greedy Filtering: prune to a minimal set; and (3) Interaction Discovery: fit a factorization machine (FM) that captures nonlinear interactions among the remaining active features and the minimal set. This pipeline yields a broad set of jailbreak-critical features, offering insight into the mechanistic basis of refusal. Moreover, we find evidence of redundant features that remain dormant unless earlier features are suppressed. Our findings highlight the potential for fine-grained auditing and targeted intervention in safety behaviours by manipulating the interpretable latent space.

Summary / 总结

Refusal on harmful prompts is a key safety behaviour in instruction-tuned large language models (LLMs), yet the internal causes of this behaviour remain poorly understood.

Prior-Aligned Data Cleaning for Tabular Foundation Models

Authors: Laure Berti-Equille

First: 2026-04-28T02:56:17+00:00 · Latest: 2026-04-28T02:56:17+00:00

Comments: 15 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

Tabular Foundation Models (TFMs) achieve state-of-the-art zero-shot accuracy on small tabular datasets by meta-learning over synthetic data-generating processes -- making them highly attractive for practitioners who cannot afford large annotated corpora. However, their in-context learning mechanism assumes approximately clean inputs: missing values, outliers, and duplicates in the real-world data create a prior mismatch that degrades both accuracy and confidence calibration simultaneously. Correcting this mismatch requires sequential decisions over cleaning operators whose interactions no static preprocessing rule can anticipate -a natural fit for reinforcement learning~(RL). We introduce L2C2, the first deep RL framework framing tabular data cleaning as prior alignment: a learned policy sequences operators to minimize the distributional gap between dirty input and the TFM's synthetic prior. Six experiments on ten OpenML benchmark datasets establish: 1) three of seven reward designs collapse to degenerate trivial cleaning strategies -- principled reward engineering is scientifically non-trivial; 2) the novel TFMAwareReward reward we propose selects structurally distinct pipelines on 4/10 datasets and achieves higher TabPFN accuracy on those diverging cases (mean 0.851 vs. 0.843; Wilcoxon p=0.063, n=4) while never underperforming; 3) parameterized cleaning actions improve best-found pipeline reward on 9/10 datasets (Wilcoxon p=0.004); and 4) a policy pre-trained on one single source dataset exceeds scratch training at the 2,000-step fine-tuning checkpoint on all three held-out datasets (up to +28.8% after full fine-tuning) demonstrating cross-dataset transfer of prior-alignment knowledge. These findings establish that prior alignment is a principled data preparation strategy for TFM deployment on real-world tabular data.

Summary / 总结

Tabular Foundation Models (TFMs) achieve state-of-the-art zero-shot accuracy on small tabular datasets by meta-learning over synthetic data-generating processes -- making them highly attractive for practitioners who cannot afford large annotated corpora.

The Last Harness You'll Ever Build

Authors: Haebin Seong, Li Yin, Haoran Zhang

First: 2026-04-22T18:51:48+00:00 · Latest: 2026-04-28T02:33:56+00:00

Abs · PDF · Code1 · Code2

Abstract

AI agents are increasingly deployed on complex, domain-specific workflows -- navigating enterprise web applications that require dozens of clicks and form fills, orchestrating multi-step research pipelines that span search, extraction, and synthesis, automating code review across unfamiliar repositories, and handling customer escalations that demand nuanced domain knowledge. \textbf{Each new task domain requires painstaking, expert-driven harness engineering}: designing the prompts, tools, orchestration logic, and evaluation criteria that make a foundation model effective. We present a two-level framework that automates this process. At the first level, the \textbf{Harness Evolution Loop} optimizes a worker agent's harness $\mathcal{H}$ for a single task: a Worker Agent $W_{\mathcal{H}}$ executes the task, an Evaluator Agent $V$ adversarially diagnoses failures and scores performance, and an Evolution Agent $E$ modifies the harness based on the full history of prior attempts. At the second level, the \textbf{Meta-Evolution Loop} optimizes the evolution blueprint $Λ= (W_{\mathcal{H}}, \mathcal{H}^{(0)}, V, E)$ itself across diverse tasks, \textbf{learning a blueprint $Λ^{(\text{best})}$ that enables rapid harness convergence on any new task -- so that adapting an agent to a novel domain requires no human harness engineering at all.} We formalize the correspondence to meta-learning and present both algorithms. The framework \textbf{shifts manual harness engineering into automated harness engineering}, and takes one step further -- \textbf{automating the design of the automation itself}.

Summary / 总结

AI agents are increasingly deployed on complex, domain-specific workflows -- navigating enterprise web applications that require dozens of clicks and form fills, orchestrating multi-step research pipelines that span search, extraction, and synthesis, automating code review across unfamiliar repositories, and handling customer escalations that demand nuanced domain knowledge.

What Makes Good Instruction-Tuning Data? An In-Context Learning Perspective

Authors: Guangzeng Han, Xiaolei Huang

Venue: ACL 2026

First: 2026-04-28T02:09:28+00:00 · Latest: 2026-04-28T02:09:28+00:00

Comments: ACL 2026, main conference

Abs · PDF · Code1 · Code2

Abstract

Instruction-tuning datasets often contain substantial redundancy and low-quality samples, necessitating effective data selection methods. We propose an instruction data selection framework based on weighted in-context influence (wICI), which measures how effectively each candidate example reduces instruction-following difficulty for semantically related peers. Through systematic experiments, we address three key questions: what constitutes effective instruction tuning data from an in-context perspective, whether sample difficulty correlates with in-context influence, and how in-context influence translates to instruction tuning effectiveness. Experiments across multiple models and benchmarks demonstrate that our method consistently outperforms existing baselines under constrained data budgets, while empirically showing that sample difficulty negatively correlates with in-context influence.

Summary / 总结

Instruction-tuning datasets often contain substantial redundancy and low-quality samples, necessitating effective data selection methods.

History

20260502_0401 20260501_0405 20260430_0407 20260429_0410 20260428_0403 20260427_0340 20260426_0338 20260425_0344 20260424_0403 20260423_0402 20260422_0359 20260421_0355 20260420_0336 20260419_0335 20260418_0352 20260417_0357 20260416_0358 20260415_0400 20260414_0400 20260413_0333 20260412_0329 20260411_0337 20260410_0359 20260409_0354 20260408_0353 20260407_0346 20260406_0328 20260405_0325 20260404_0333 20260403_0343 20260401_0350 20260331_0350 20260330_0328 20260328_0336 20260327_0351 20260326_0341 20260325_0349 20260324_0342 20260323_0319 20260322_0318 20260321_0332 20260320_0341 20260319_0343 20260318_0350 20260317_0353 20260316_0322 20260315_0321 20260314_0326 20260313_0341 20260312_0337 20260311_0333 20260310_0335 20260309_0318 20260308_0315 20260307_0329 20260306_0349 20260305_0332 20260304_0334 20260303_0332 20260302_0317 20260228_2322 20260228_2259 20260228_0348 20260227_0354 20260226_0402 20260225_0404 20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553