Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
Authors: Jingze Ge, Yun Liu, Xue Geng, Wanqi Dong, Wang Zhe Mark, Min Wu, Xulei Yang
First: 2026-05-04T17:05:45+00:00 · Latest: 2026-05-04T17:05:45+00:00
Comments: 15 pages, 3 figures, supplementary material included
Abstract
Adapting large pretrained models to diverse tasks is now routine, yet the two dominant strategies of parameter-efficient fine-tuning (PEFT) and low-rank compression are typically composed in sequence. This decoupled practice first compresses and then fine-tunes adapters, potentially misaligning the compressed subspace with downstream objectives and squandering a global parameter budget. To overcome this limitation, we introduce JACTUS (Joint Adaptation and Compression with a Task-aware Union of Subspaces), a single framework that unifies compression and adaptation. From a small calibration set, JACTUS estimates input and pre-activation gradient covariances, forms their orthogonal union with the pretrained weight subspace, performs a projected low-rank approximation inside this union, allocates rank globally by marginal gain per parameter, and trains only a compact core matrix. This explicitly mitigates the potential misalignment between the compressed subspace and downstream objectives by coupling the directions preserved for compression with those required for adaptation, yielding a deployable low-rank model that avoids retaining full frozen weights while enabling fast and robust tuning. On vision, JACTUS attains an average 89.2% accuracy on ViT-Base across eight datasets at 80% retained parameters, surpassing strong 100% PEFT baselines (e.g., DoRA 87.9%). On language, JACTUS achieves an 80.9% average on Llama2-7B commonsense QA at the same 80% retained-parameter budget, outperforming 100% PEFT (e.g., DoRA 79.7%) and exceeding prior compress-then-finetune pipelines under the same ratained-parameter budget. We will release code.
Summary / 总结
Adapting large pretrained models to diverse tasks is now routine, yet the two dominant strategies of parameter-efficient fine-tuning (PEFT) and low-rank compression are typically composed in sequence.
Bolek: A Multimodal Language Model for Molecular Reasoning
Authors: Frederic Grabowski, Jacek Szczerbiński, Maciej Jaśkowski, Kalina Jasińska-Kobus, Paweł Dąbrowski-Tumański, Tomasz Jetka, Bartosz Topolski
First: 2026-05-04T15:46:39+00:00 · Latest: 2026-05-04T15:46:39+00:00
Abstract
Molecular property models increasingly support high-stakes drug-discovery decisions, but their outputs are often difficult to audit: classical predictors return scores without rationale, while language models can produce fluent explanations weakly grounded in the input molecule.
We introduce Bolek, a compact multimodal language model that grounds natural-language reasoning in molecular structure by injecting a Morgan fingerprint embedding into an instruction-tuned text decoder. Bolek is fine-tuned on molecular alignment tasks, including molecule description, RDKit descriptor prediction, and substructure detection, and on downstream reasoning over 15 TDC binary classification tasks using synthetic chains-of-thought anchored in concrete molecular features.
Across these tasks, Bolek outperforms its Qwen3-4B-Instruct base on all endpoints in yes/no mode and on 13 of 15 in chain-of-thought mode, raising mean ROC/PR AUC from 0.55 to 0.76. It also outperforms TxGemma-9B-Chat on 13 of 15 binary classification tasks despite being less than half its size. Bolek's explanations are more grounded than those of the baseline LLMs: it cites numerical descriptors 10-100x more often per chain-of-thought, and the cited values agree strongly with RDKit for key descriptors such as TPSA, MolLogP, and MolWt (Spearman rho = 0.87-0.91). Generalisation extends beyond the training panel: on 15 unseen TDC classification endpoints, Bolek matches TxGemma on five, and it produces non-trivial rank correlations on three held-out regression endpoints despite never seeing downstream regression during training.
These results suggest that targeted modality injection and reasoning supervision tied to verifiable molecular features can yield compact, auditable molecular reasoning models.
Summary / 总结
Molecular property models increasingly support high-stakes drug-discovery decisions, but their outputs are often difficult to audit: classical predictors return scores without rationale, while language models can produce fluent explanations weakly grounded in the input molecule.
CNNs for Vis-NIR Chemometrics: From Contradiction to Conditional Design
Authors: Dário Passos
First: 2026-05-04T14:21:02+00:00 · Latest: 2026-05-04T14:21:02+00:00
Comments: 19 pages, 1 figure, review article
Abstract
Near-infrared (NIR; a.k.a.\ NIRS) deep-learning studies in chemometrics increasingly report mutually inconsistent conclusions regarding convolutional neural network (CNN) design, including small versus large kernels, shallow versus deep architectures, raw spectra versus preprocessing, and single-domain training versus transfer learning. As a result, the same architecture can appear superior in one study and inferior in another, creating a practical impasse for chemometric practitioners. In this review, we argue that these contradictions are not evidence of irreconcilable methods but a structurally expected consequence of uncontrolled moderating variables. Specifically, we trace recurring disagreements to (i) the indirect nature of Vis--NIR measurement in water-dominated matrices, (ii) mismatch between effective receptive field (ERF) and the width of informative spectral structure, and (iii) validation design (including split strategy, hyperparameter tuning budget, and exposure to deployment-like shifts) acting as a hidden hyperparameter that can dominate model ranking. Building on evidence from published chemometrics and spectroscopy studies, we propose a conditional design framework that links architecture and preprocessing choices to spectral physics, dataset regime, and intended deployment scenario. Overall, the proposed perspective moves DL Chemometrics from template-driven architecture selection toward reproducible, physics-aware, and deployment-aligned model comparison.
Summary / 总结
Near-infrared (NIR; a.k.a.\ NIRS) deep-learning studies in chemometrics increasingly report mutually inconsistent conclusions regarding convolutional neural network (CNN) design, including small versus large kernels, shallow versus deep architectures, raw spectra versus preprocessing, and single-domain training versus transfer learning.
TRACED: In vivo imaging of extracellular intrinsic diffusivity, tortuosity, cell size distribution and cell density in human glioma patients
Authors: Joshua K. Marchant, Hong-Hsi Lee, Elizabeth R. Gerstner, Susie Y. Huang, Bruce R. Rosen
First: 2026-05-04T14:03:48+00:00 · Latest: 2026-05-04T14:03:48+00:00
Comments: 14 pages, 8 figures (main); 2 pages, 4 figures (supplementary). Submitted to Magnetic Resonance in Medicine
Abstract
The lack of analytical models describing diffusion time dependence at intermediate time scales in complex tissue microstructure limits the accurate quantification of extracellular diffusivity and tissue microstructure. We introduce TRACED, a biophysical model that incorporates diffusion time dependence in cell distributions to quantify pathologically-relevant properties in solid tumors. Neural networks were trained on Monte Carlo diffusion simulations using sphere distribution-based geometries to enable the rapid computation of time-dependent diffusion MRI signals in cell populations of variable cell size. Model sensitivity and fit performance were assessed via simulation. Diffusion data from eight mixed-grade glioma patients was fitted using the TRACED model. Data fitting was performed using a novel physics-informed transfer learning pipeline, Sim2PINN. In two patients, cell size measurements were compared directly with image-localized histology. Simulation results indicate improved parameter estimation compared to the simple two-compartment model. TRACED enabled the simultaneous in vivo quantification of intracellular volume fraction, cell size distribution, extracellular intrinsic diffusivity, and tortuosity in glioma patients. Neural network implementations of diffusion time-dependence and tortuosity showed behavior consistent with coarse-graining and effective medium theory, respectively. Future work will explore the clinical utility of TRACED parameters in additional patients.
Summary / 总结
The lack of analytical models describing diffusion time dependence at intermediate time scales in complex tissue microstructure limits the accurate quantification of extracellular diffusivity and tissue microstructure.
SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures
Authors: Nedjma Ousidhoum, Junho Myung, Carla Perez-Almendros, Jiho Jin, Amr Keleg, Meriem Beloucif, Yi Zhou, Rodrigo Agerri, Vladimir Araujo, Naomi Baes, James Barry, Joanne Boisson, Nancy F. Chen, Christine de Kock, Aleksandra Edwards, Joseba Fernandez de Landa, Mohamed Fazli Imam, Huda Hakami, Shu-Kai Hsieh, Joseph Marvin Imperial, Roy Ka-Wei Lee, Zhengyuan Liu, Chenyang Lyu, Younes Samih, Johan Sjons, Bryan Tan, Asahi Ushio, Weihua Zheng, Alice Oh, Jose Camacho-Collados
First: 2026-05-04T13:49:44+00:00 · Latest: 2026-05-04T13:49:44+00:00
Comments: SemEval-2026 Task Description Paper. Data and resources are available at \url{https://github.com/BLEnD-SemEval2026/SemEval-2026-Task-7
Abstract
We present our shared task on evaluating the adaptability of LLMs and NLP systems across multiple languages and cultures. The task data consist of an extended version of our manually constructed BLEnD benchmark (Myung et al. 2024), covering more than 30 language-culture pairs, predominantly representing low-resource languages spoken across multiple continents. As the task is designed strictly for evaluation, participants were not permitted to use the data for training, fine-tuning, few-shot learning, or any other form of model modification. Our task includes two tracks: (a) Short-Answer Questions (SAQ) and (b) Multiple-Choice Questions (MCQ). Participants were required to predict labels and were allowed to submit any NLP system and adopt diverse modelling strategies, provided that the benchmark was used solely for evaluation. The task attracted more than 140 registered participants, and we received final submissions from 62 teams, along with 19 system description papers. We report the results and present an analysis of the best-performing systems and the most commonly adopted approaches. Furthermore, we discuss shared insights into open questions and challenges related to evaluation, misalignment, and methodological perspectives on model behaviour in low-resource languages and for under-represented cultures.
Summary / 总结
We present our shared task on evaluating the adaptability of LLMs and NLP systems across multiple languages and cultures.
Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement
Authors: Nils Strassenburg, Boris Glavic, Tilmann Rabl
First: 2025-12-05T08:36:39+00:00 · Latest: 2026-05-04T08:44:53+00:00
Abstract
Businesses increasingly rely on large language models (LLMs) to automate simple repetitive tasks instead of developing custom machine learning models. LLMs require few, if any, training examples and can be utilized by users without expertise in model development. However, this comes at the cost of substantially higher resource and energy consumption compared to smaller models, which often achieve similar predictive performance for simple tasks. In this paper, we present our vision for just-in-time model replacement (JITR), where, upon identifying a recurring task in calls to an LLM, the model is replaced transparently with a cheaper alternative that performs well for this specific task. JITR retains the ease of use and low development effort of LLMs, while saving significant cost and energy. We discuss the main challenges in realizing our vision regarding the identification of recurring tasks and the creation of a custom model. Specifically, we argue that model search and transfer learning will play a crucial role in JITR to efficiently identify and fine-tune models for a recurring task. Using our JITR prototype Poodle, we achieve significant savings for exemplary tasks.
Summary / 总结
Businesses increasingly rely on large language models (LLMs) to automate simple repetitive tasks instead of developing custom machine learning models.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
Authors: Sanjiv R. Das, Harshad Khadilkar, Sukrit Mittal, Daniel Ostrov, Deep Srivastav, Hungjen Wang
Venue: The Journal of Finance and Data Science, Volume 12, 2026, 100186,ISSN 2405-9188
First: 2026-05-04T07:48:02+00:00 · Latest: 2026-05-04T07:48:02+00:00
Abstract
Applying concepts related to zero-shot meta-learning and pre-training of foundation models, we develop a meta reinforcement learning approach (denoted MetaRL) that is pre-trained on thousands of goals-based wealth management (GBWM) problems. Each GBWM problem involves a multiple year scenario over which the investor looks to optimally choose an investment portfolio each year and choose to fulfill all, some, or none of the different financial goals that arise each year. These choices seek to maximize the expected total investor utility obtained from the fulfilled financial goals. By eliminating separate training and optimization for each new investor problem, the MetaRL model in inference mode produces near-optimal dynamic investment portfolio and goal-fulfilling strategies for a new GBWM problem within a few hundredths of a second. This delivers expected utilities that are, on average, 97.8% of the optimal expected utilities (determined via Dynamic Programming). These results are remarkably robust to capital market regime changes, even when training uses only one capital market regime. Further, the MetaRL approach can enable solving problems with larger state spaces where Dynamic Programming becomes computationally infeasible.
Summary / 总结
Applying concepts related to zero-shot meta-learning and pre-training of foundation models, we develop a meta reinforcement learning approach (denoted MetaRL) that is pre-trained on thousands of goals-based wealth management (GBWM) problems.
Reliability-Oriented Multilingual Orthopedic Diagnosis: A Domain-Adaptive Modeling and a Conceptual Validation Framework
Authors: Danish Ali, Li Xiaojian, Sundas Iqbal, Farrukh Zaidi
First: 2026-05-04T06:20:36+00:00 · Latest: 2026-05-04T06:20:36+00:00
Abstract
Large Language Models (LLMs) are increasingly proposed for clinical decision support including multilingual diagnosis in low-resource settings. However, their reliability, calibration and safety characteristics remain insufficiently understood for structured, high-risk tasks. We present a system-level analysis of multilingual orthopedic diagnosis from free-text clinical notes in English, Hindi and Punjabi. We evaluate three modeling regimes: (i) task-aligned multilingual transformer encoders, (ii) a task-fine-tuned baseline (DistilBERT), and (iii) a domain-adaptive architecture tailored to orthopedic text (IndicBERT-HPA). These models are compared with zero-shot, instruction-tuned LLMs to assess suitability for structured diagnostic classification. Results indicate that while LLMs exhibit strong linguistic fluency, they show unstable calibration and reduced reliability under structured multilingual conditions, particularly in low-resource languages. These findings are specific to zero-shot evaluation and do not imply limitations of fine-tuned models. Domain-adaptive specialization substantially improves cross-lingual discrimination and confidence behavior. IndicBERT-HPA, with language-specific orthopedic adapter heads achieves consistently strong performance across six diagnostic categories and more predictable deployment characteristics than task-only adaptation. Building on these observations, we outline a conceptual deterministic agent-based validation framework for future implementation, formalizing evidence checks, language-sensitive validation and conservative human-in-the-loop gating. Reliable multilingual clinical decision support requires specialized architecture, explicit reliability analysis, and structured validation for safety-critical systems.
Summary / 总结
Large Language Models (LLMs) are increasingly proposed for clinical decision support including multilingual diagnosis in low-resource settings.
Demographic-Aware Transfer Learning for Sleep Stage Classification in Clinical Polysomnography
Authors: S M Asif Hossain, Shruti Kshirsagar
First: 2026-05-04T05:38:20+00:00 · Latest: 2026-05-04T05:38:20+00:00
Comments: Under review at IEEE SMC 2026
Abstract
Automated sleep stage classification typically employs a single population-agnostic model, disregarding established demographic variations in sleep architecture. Sleep patterns, however, differ substantially across gender, age, and obstructive sleep apnea (OSA) severity, indicating that a onesize-fits all approach may be suboptimal for diverse clinical populations. In this paper, we propose a two stage training strategy based on demographic stratification and transfer learning framework. We first pretrains a convolutional recurrent model on the full population and then fine tunes it independently for demographic subgroups defined by gender, age, and Apnea-Hypopnea Index (AHI) severity according to the AASM clinical standard. Using the DREAMT dataset comprising 100 clinical subjects and 7 PSG channels, we evaluate 37 fine-tuned configurations across single-axis and two-way demographic combinations. Results demonstrate that 35 of the 37 fine-tuned models outperform the baseline, with Cohen's kappa improvements ranging from 0.9 to 12.9%. These findings indicate that stratified fine tuning tailored to specific patient demographics yields substantially more accurate sleep staging than a single generalized model, offering a practical and clinically grounded paradigm for personalized sleep assessment.
Summary / 总结
Automated sleep stage classification typically employs a single population-agnostic model, disregarding established demographic variations in sleep architecture.
InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation
Authors: Jinqi Xiao, Qing Yan, Liming Jiang, Zichuan Liu, Hao Kang, Shen Sang, Tiancheng Zhi, Jing Liu, Cheng Yang, Xin Lu, Bo Yuan
First: 2025-12-25T21:37:12+00:00 · Latest: 2026-05-03T22:31:54+00:00
Abstract
Parameter-Efficient Fine-Tuning of Diffusion Transformers (DiTs) for diverse, multi-conditional tasks often suffers from task interference when using monolithic adapters like LoRA. The Mixture of Low-rank Experts (MoLE) architecture offers a modular solution, but its potential is usually limited by routing policies that operate at a token level. Such local routing can conflict with the global nature of user instructions, leading to artifacts like spatial fragmentation and semantic drift in complex image generation tasks. To address these limitations, we introduce InstructMoLE, a novel framework that employs an Instruction-Guided Mixture of Low-Rank Experts. Instead of per-token routing, InstructMoLE utilizes a global routing signal, Instruction-Guided Routing (IGR), derived from the user's comprehensive instruction. This ensures that a single, coherently chosen expert council is applied uniformly across all input tokens, preserving the global semantics and structural integrity of the generation process. To complement this, we introduce an output-space orthogonality loss, which promotes expert functional diversity and mitigates representational collapse. Extensive experiments demonstrate that InstructMoLE significantly outperforms existing LoRA adapters and MoLE variants across challenging multi-conditional generation benchmarks. Our work presents a robust and generalizable framework for instruction-driven fine-tuning of generative models, enabling superior compositional control and fidelity to user intent.
Summary / 总结
Parameter-Efficient Fine-Tuning of Diffusion Transformers (DiTs) for diverse, multi-conditional tasks often suffers from task interference when using monolithic adapters like LoRA.
Learn-to-learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM
Authors: Luo Ji, Qi Qin, Ningyuan Xi, Teng Chen, Qingqing Gu, Hongyan Li
First: 2026-05-03T17:13:45+00:00 · Latest: 2026-05-03T17:13:45+00:00
Comments: Accepted by ICML2026
Abstract
Conventional LLMs may suffer from corpus heterogeneity and subtle condition changes. While finetuning can create the catastrophe forgetting issue, application of meta-learning on LLMs is also limited due to its complexity and scalability. In this paper, we activate the meta-signal of $β$ within the SwiGLU blocks, resulting in a meta-gating mechanism that adaptively adjusts the nonlinearity of FFN. A hypernetwork is employed which dynamically produces $β$ on textual conditions, providing meta-controllability on LLMs. By testing on different condition types such as task, domain, persona, and style, our method outperforms finetuning and meta-learning baselines, and can generalize reasonably on unseen tasks, condition types, or instructions. Our code can be found in https://github.com/AaronJi/MeGan.
Summary / 总结
Conventional LLMs may suffer from corpus heterogeneity and subtle condition changes.
Flexi-LoRA with Input-Adaptive Ranks: Efficient Finetuning for Speech and Reasoning Tasks
Authors: Zongqian Li, Yixuan Su, Han Zhou, Zihao Fu, Nigel Collier
First: 2026-05-03T16:45:36+00:00 · Latest: 2026-05-03T16:45:36+00:00
Abstract
Parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA) have become essential for deploying large language models, yet their static parameter allocation remains suboptimal for inputs of varying complexity. We present Flexi-LoRA, a novel framework that dynamically adjusts LoRA ranks based on input complexity during both training and inference. Through empirical analysis across question answering, mathematical reasoning, and speech tasks, we demonstrate that maintaining consistency between training and inference dynamics is important for effective adaptation, particularly for sequential reasoning tasks. Our findings reveal that input-dependent parameter allocation achieves higher performance with fewer parameters by optimally matching rank configurations to question complexity. Furthermore, task-specific dependency on rank dynamics varies, with mathematical reasoning tasks exhibiting higher dependency than QA tasks. Successful adaptation manifests not only in correctness but also in reasoning quality and instruction adherence. Flexi-LoRA consistently outperforms static LoRA while using fewer parameters, with performance gains more pronounced on tasks requiring strict reasoning chains. Our approach realizes key benefits of mixture-of-experts frameworks through a more streamlined implementation, reducing parameter redundancy while improving model capabilities. We provide comprehensive empirical studies across diverse tasks, establishing a basis for future work in input-adaptive and efficient fine-tuning approaches.
Summary / 总结
Parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA) have become essential for deploying large language models, yet their static parameter allocation remains suboptimal for inputs of varying complexity.
GD-FPS: Growth-Driven Feedforward Parameter Selection for Efficient Fine-Tuning
Authors: Kenneth Yang, Wen-Li Wei, Jen-Chun Lin
First: 2025-10-31T10:44:16+00:00 · Latest: 2026-05-03T07:50:19+00:00
Abstract
Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key strategy for adapting large-scale pre-trained models to downstream tasks, but existing approaches face notable limitations. Addition-based methods, such as Adapters, introduce inference latency and engineering complexity, whereas selection-based methods like Gradient-based Parameter Selection (GPS) require a full backward pass. The reliance on gradients not only incurs massive memory usage and substantial computational latency, but also leaves the selection vulnerable to the randomness of stochastic batch sampling. To resolve this, we propose Growth-Driven Feedforward Parameter Selection (GD-FPS). Operating entirely via forward passes, this strictly gradient-free method identifies the optimal parameter subset by scaling intrinsic weight magnitudes by their relative activation growth against a pre-training anchor. Evaluated on $26$ visual tasks spanning image classification and semantic segmentation, GD-FPS achieves competitive or superior performance over state-of-the-art PEFT baselines. Crucially, compared to GPS, it reduces peak memory usage by nearly $18\times$ and accelerates execution by over $2.7\times$ during the parameter selection stage. By guaranteeing deterministic selection, GD-FPS offers a memory-efficient, fast, and robust solution for fine-tuning.
Summary / 总结
Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key strategy for adapting large-scale pre-trained models to downstream tasks, but existing approaches face notable limitations.
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
Authors: Zheda Mai, Arpita Chowdhury, Zihe Wang, Sooyoung Jeon, Lemeng Wang, Jiacheng Hou, Jihyung Kil, Wei-Lun Chao
Venue: CVPR 2026
First: 2025-06-10T05:43:34+00:00 · Latest: 2026-05-03T04:37:38+00:00
Comments: Accepted by CVPR 2026. The first two authors contribute equally
Abstract
The rise of vision foundation models (VFMs) calls for systematic evaluation. A common approach pairs VFMs with large language models (LLMs) as general-purpose heads, followed by evaluation on broad Visual Question Answering (VQA) benchmarks. However, this protocol has two key blind spots: (i) the instruction tuning data may not align with VQA test distributions, meaning a wrong prediction can stem from such data mismatch rather than a VFM' visual shortcomings; (ii) VQA benchmarks often require multiple visual abilities, making it hard to tell whether errors stem from lacking all required abilities or just a single critical one. To address these gaps, we introduce AVA-Bench, the first benchmark that explicitly disentangles 14 Atomic Visual Abilities (AVAs) -- foundational skills like localization, depth estimation, and spatial understanding that collectively support complex visual reasoning tasks. By decoupling AVAs and matching training and test distributions within each, AVA-Bench pinpoints exactly where a VFM excels or falters. Applying AVA-Bench to leading VFMs thus reveals distinctive "ability fingerprints," turning VFM selection from educated guesswork into principled engineering. Notably, we find that a 0.5B LLM yields similar VFM rankings as a 7B LLM while cutting GPU hours by 8x, enabling more efficient evaluation. By offering a comprehensive and transparent benchmark, we hope AVA-Bench lays the foundation for the next generation of VFMs.
Summary / 总结
The rise of vision foundation models (VFMs) calls for systematic evaluation.
Chebyshev-Augmented One-Shot Transfer Learning for PINNs on Nonlinear Differential Equations
Authors: Yiqi Rao, Pavlos Protopapas
Venue: ICLR 2026
First: 2026-05-02T22:49:37+00:00 · Latest: 2026-05-02T22:49:37+00:00
Comments: 18 pages, 4 figures, 9 tables, accepted to ICLR 2026 Workshop on Artificial Intelligence and Partial Differential Equations
Abstract
Physics-Informed Neural Networks (PINNs) offer a flexible paradigm for solving differential equations by embedding governing laws into the training objective. A persistent limitation is instance specificity: standard PINNs typically require retraining for each new forcing term, boundary/initial condition, or parameter setting. One-shot transfer learning (OTL) addresses this bottleneck for linear operators by freezing a pretrained latent representation and computing optimal output weights in closed form, but for nonlinear problems closed-form adaptation is generally unavailable because the loss is nonconvex in the output layer.
In this paper we substantially broaden the class of nonlinearities amenable to one-shot PINN transfer by combining OTL with Chebyshev polynomial surrogates. We approximate general smooth weakly nonlinear terms by truncated Chebyshev expansions over a prescribed solution range, yielding a polynomial nonlinearity that can be handled by a perturbative decomposition into linear subproblems. A multi-head PINN learns a reusable latent space associated with the dominant linear operator; at test time, solutions to new instances are obtained via a sequence of closed-form linear solves in the output layer, without retraining the network body.
We provide a unified derivation of the framework for ODEs and PDEs and demonstrate accuracy and fast online adaptation on nonlinear benchmarks, including non-polynomial and singular ODE nonlinearities as well as a reaction-diffusion PDE with saturating kinetics, demonstrating the method's utility in many-query regimes.
Summary / 总结
Physics-Informed Neural Networks (PINNs) offer a flexible paradigm for solving differential equations by embedding governing laws into the training objective.
Alignment midtraining for animals
Authors: Jasmine Brazilek, Miles Tidmarsh
First: 2026-03-21T01:32:24+00:00 · Latest: 2026-05-02T22:28:15+00:00
Comments: 34 pages
Abstract
We investigate the robustness of value alignment via midtraining with synthetic documents, using animal compassion as a value that is both important in its own right and orthogonal to existing alignment efforts. To evaluate compassionate reasoning, we develop and publicly release Animal Norms In Moral Assessment (ANIMA), a 26-question evaluation spanning 13 ethical dimensions, publicly available as a dataset and Inspect evaluation. On ANIMA, training with 3000 documents achieves 77% compared to 40% for instruction-tuning approaches, with generalization to human compassion and no degradation in standard safety benchmarks or capabilities. However, subsequent unrelated instruction-tuning degrades the intervention, with the advantage disappearing after 5000 samples. Our exploratory results suggest document-based value interventions may require explicit preservation strategies to remain effective through typical training pipelines.
Summary / 总结
We investigate the robustness of value alignment via midtraining with synthetic documents, using animal compassion as a value that is both important in its own right and orthogonal to existing alignment efforts.
Meta-learning Structure-Preserving Dynamics
Authors: Cheng Jing, Uvini Balasuriya Mudiyanselage, Woojin Cho, Minju Jo, Anthony Gruber, Kookjin Lee
Venue: ICML 2026
First: 2025-08-15T04:30:27+00:00 · Latest: 2026-05-02T20:59:06+00:00
Comments: Accepted to ICML 2026; full camera-ready version will be updated later
Abstract
Structure-preserving approaches to dynamics discovery have demonstrated great potential for modeling physical systems due to their use of strong inductive biases, which enforce key features such as conservation laws and dissipative behavior. However, these models are typically trained on a per-configuration basis, requiring explicit knowledge of system parameters and costly retraining when these parameters vary. While meta-learning provides a potential remedy, optimization-based approaches can suffer from limited generalizability. Motivated by recent advances in modulation-based learning aimed at mitigating these drawbacks, we systematically investigate the use of modulation techniques in learning conservative dynamical systems. We study a range of existing modulation strategies alongside newly proposed variants, integrating them into a Hamiltonian learning framework without requiring an explicit system parameterization. Through extensive experiments on benchmark problems, we demonstrate that modulation-based meta-learning enables accurate few-shot adaptation, achieving robust generalization across parameter space without compromising the conservation of key invariants responsible for the dynamics.
Summary / 总结
Structure-preserving approaches to dynamics discovery have demonstrated great potential for modeling physical systems due to their use of strong inductive biases, which enforce key features such as conservation laws and dissipative behavior.
Stable Localized Conformal Prediction via Transduction
Authors: Yinjie Min, Liuhua Peng, Changliang Zou
First: 2026-05-02T14:02:16+00:00 · Latest: 2026-05-02T14:02:16+00:00
Abstract
Existing evaluations of conformal prediction, such as prediction efficiency and test-conditional coverage, are defined in expectation over the calibration data. In practice, when only one calibration set of limited size is available, prediction sets often exhibit high variability in size, especially for methods with localization. We formalize this concern as set stability, defined as the variance of the conditional expectation of the set size given the calibration data. To improve stability without requiring additional target-task labels, we propose Stable Conformal Prediction (StCP), a transfer learning approach that utilizes labeled source-task data and unlabeled target data. Theoretically, we characterize the marginal coverage and stability of StCP; empirically, it delivers more stable prediction sets than standard conformal prediction methods, especially for those with localization, when calibration data are limited.
Summary / 总结
Existing evaluations of conformal prediction, such as prediction efficiency and test-conditional coverage, are defined in expectation over the calibration data.
The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles
Authors: Hiun Kim, Tae Kwan Lee, Taeryun Won
First: 2026-05-02T12:07:49+00:00 · Latest: 2026-05-02T12:07:49+00:00
Abstract
Masked Language Modeling (MLM) pre-training is one of the primary ways to initialize Neural Information Retrieval (IR) models prior to retrieval fine-tuning. However, studies show that MLM pre-trained models have limited readiness and transfer learning issues for fine-tuning them into Neural Bi-Encoder models. This paper studies the effect of different pre-training datasets and pre-training options on the MLM pre-trained models for retrieval fine-tuning. The study focuses on the SPLADE-style model, which uses the MLM layer also at fine-tuning time. More specifically, we experimented with Expanded-SPLADE (ESPLADE) models, a specific instance of SPLADE models, and in-house web document titles are used as datasets. Pre-training, fine-tuning, and evaluation with optional test-time pruning of sparse vectors are conducted.
Our observations are three-fold: First, fine-tuned models of higher retrieval effectiveness at both unpruned and most strict pruned settings are mostly pre-trained on a general corpus, and pre-trained with a higher learning rate, showing lower MLM accuracies. Second, in the most strict pruned setting, those models show higher-level retrieval cost and a higher variance in the length of the individual postings list. Third, the repetition of the general pre-training dataset does not have much effect on retrieval effectiveness. The experimentation empirically identifies the potential limitations for aligning MLM pre-training to ESPLADE fine-tuning. Also, the experimentation provides an empirical observation that, at most strict pruned settings, the retrieval effectiveness is better maintained by the higher-level retrieval cost, showing the trade-off relationship between the two in our setting.
Summary / 总结
Masked Language Modeling (MLM) pre-training is one of the primary ways to initialize Neural Information Retrieval (IR) models prior to retrieval fine-tuning.
Addressing Data Scarcity in Bangla Fake News Detection: An LLM-Based Dataset Augmentation Approach
Authors: Ahmed Alfey Sani, Kazi Akib Zaoad, Shefayat E Shams Adib, Md Abdul Muqtadir, Ajwad Abrar
First: 2026-05-02T07:02:29+00:00 · Latest: 2026-05-02T07:02:29+00:00
Comments: Accepted in 15th ACM ICSCA, 2026 in Langkawi, Malaysia
Abstract
The growing spread of misinformation in digital media highlights the need for reliable fake news detection systems, yet progress in under-resourced languages such as Bangla is limited by small and imbalanced datasets. This study investigates whether Large Language Model (LLM) based augmentation can effectively address this limitation and improve Bangla fake news classification. Existing datasets remain valuable but highly imbalanced, limiting model performance, and LLM based augmentation for Bangla has been scarcely explored. To fill this gap, we propose a systematic augmentation framework that generates synthetic Bangla news articles using the instruction tuned Gemma 3 27B IT model, supported by semantic filtering and controlled subsampling to preserve label consistency and diversity. We compare zero shot and few shot prompting, evaluate multiple augmentation rates, and examine random versus similarity-based selection strategies. Our experiments show that augmenting only the minority class with a high augmentation rate and random subsampling yields the strongest gains, raising the Fake News F1 score from 0.85 to 0.88. To support reproducibility and further research in this low-resource domain, we publicly release 4,545 synthetically generated Bangla fake news samples along with our full implementation. These findings demonstrate that well-designed LLM-driven augmentation can significantly improve fake news detection in low resource settings and provide a practical foundation for advancing multilingual misinformation research.
Summary / 总结
The growing spread of misinformation in digital media highlights the need for reliable fake news detection systems, yet progress in under-resourced languages such as Bangla is limited by small and imbalanced datasets.
Developing a Strong Pre-Trained Base Model for Plant Leaf Disease Classification
Authors: David J. Richter
First: 2026-05-02T06:33:38+00:00 · Latest: 2026-05-02T06:33:38+00:00
Comments: Master's thesis
Abstract
Plants, crops and their yields are essential to our very existence, but diseases and pests cause large losses every year. As such it is vital to ensure that diseases can be spotted early and treated accordingly and stopping the spread while still possible. Manual and traditional methods require personal to walk through the field and check for symptoms 'by hand'. This is very laborious and very time consuming, so ML methods have been applied as a result and they have garnered promising results. CNN models are especially efficient as they can automatically extract features from images without any manual feature construction before then feeding the features to a classifier. Datasets are largely influential to the final performance of the model. Despite the importance that datasets pose to the field, there still seems to be somewhat of a discrepancy between what is publicly available for use and what would be required to sufficiently train fully capable models. To overcome these shortcomings, as part of this thesis open datasets for the field of plant leaf disease classification have been identified as well as models that can be trained on them and extensive benchmarks have been carried out to identify their suitability. Then a new dataset was constructed based on those findings as well as on the findings of a augmentation applicability study, which will be used to train a new Base Model based on the DenseNet201 architecture, which managed to outperform the baseline model on said new dataset as well as outperforming it on plant leaf disease classification domain specific Transfer-Learning experiments on another new dataset. This new model manages to train models through Transfer-Learning (TL) faster, more robust, more stable, and with less data than general model would, overcoming a large number of issues that the field still suffers from.
Summary / 总结
Plants, crops and their yields are essential to our very existence, but diseases and pests cause large losses every year.
TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models
Authors: Mihai Nadas, Laura Diosan, Andrei Piscoran, Andreea Tomescu
First: 2025-04-29T10:15:28+00:00 · Latest: 2026-05-02T05:59:29+00:00
Comments: 18 pages, 6 tables, 1 figure. v2: revised evaluation with open-weight LLM judge panel, expanded citations
Abstract
Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We present TF1-EN-3M, to our knowledge the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character -> trait -> setting -> conflict -> resolution -> moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space.
A fully reproducible evaluation pipeline employs a panel of open-weight LLM judges from distinct model families, scoring grammar, creativity, moral clarity, and template adherence, complemented by reference-free diversity and readability metrics. Among ten open-weight generator candidates, an 8B-parameter Llama-3 variant delivers the best quality-cost trade-off, producing high-scoring fables on consumer hardware at approximately $0.135 per 1,000 fables.
We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI -- demonstrating that large-scale moral storytelling requires neither proprietary giant models nor proprietary evaluation infrastructure.
Summary / 总结
Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons.
GIFT: Guided Fine-Tuning and Transfer for Enhancing Instruction-Tuned Language Models
Authors: Zhiwen Ruan, Yichao Du, Jianjie Zheng, Longyue Wang, Yun Chen, Peng Li, Jinsong Su, Yang Liu, Guanhua Chen
Venue: ACL 2026
First: 2026-05-02T05:27:59+00:00 · Latest: 2026-05-02T05:27:59+00:00
Abstract
A promising paradigm for adapting instruction-tuned language models is to learn task-specific updates on a pretrained base model and subsequently merge them into the instruction-tuned model. However, existing approaches typically treat the instruction-tuned model as a passive target that is only involved at the final merging stage, without guiding the training process. We propose GIFT (Guided Fine-Tuning and Transfer), a simple and efficient framework that incorporates guidance from the instruction model into task adaptation. GIFT fine-tunes a low-rank adapter on the pretrained base model using confidence signals derived from the instruction-tuned model. The learned adapter is then merged into the instruction-tuned model, yielding task-specialized models that preserve general instruction-following behavior. We evaluate GIFT on mathematical and knowledge-intensive benchmarks across multiple model families and scales. Results show that GIFT consistently outperforms direct fine-tuning and representative transfer-based baselines, while maintaining robust generalization and favorable test-time scaling behavior.
Summary / 总结
A promising paradigm for adapting instruction-tuned language models is to learn task-specific updates on a pretrained base model and subsequently merge them into the instruction-tuned model.
Task-Driven Subspace Decomposition for Knowledge Sharing and Isolation in LoRA-based Continual Learning
Authors: Lingfeng He, De Cheng, Huaijie Wang, Xi Yang, Nannan Wang, Xinbo Gao
Venue: ICML 2026
First: 2026-02-27T02:31:00+00:00 · Latest: 2026-05-02T01:48:44+00:00
Comments: Accepted by ICML 2026
Abstract
Continual Learning (CL) requires models to sequentially adapt to new tasks without forgetting old knowledge. Recently, Low-Rank Adaptation (LoRA), a representative Parameter-Efficient Fine-Tuning (PEFT) method, has gained increasing attention in CL. Several LoRA-based CL methods reduce interference across tasks by separating their update spaces, typically building the new space from the estimated null space of past tasks. However, they (i) overlook task-shared directions, which suppresses knowledge transfer, and (ii) fail to capture truly effective task-specific directions since these ``null bases" of old tasks can remain nearly inactive for new task under correlated tasks. To address this, we study LoRA learning capability from a projection energy perspective, and propose Low-rank Decomposition and Adaptation (LoDA). It performs a task-driven decomposition to build general and truly task-specific LoRA subspaces by solving two energy-based objectives, decoupling directions for knowledge sharing and isolation. LoDA fixes LoRA down-projections on two subspaces and learns robust up-projections via a Gradient-Aligned Optimization (GAO) approach. After each task, before integrating the LoRA updates into the backbone, LoDA derives a closed-form recalibration for the general update, approximating a feature-level joint optimum along this task-shared direction. Experiments indicate that LoDA outperforms existing CL methods.
Summary / 总结
Continual Learning (CL) requires models to sequentially adapt to new tasks without forgetting old knowledge.
The Last Harness You'll Ever Build
Authors: Haebin Seong, Li Yin, Haoran Zhang, Zhan Shi
First: 2026-04-22T18:51:48+00:00 · Latest: 2026-05-01T22:22:53+00:00
Abstract
AI agents are increasingly deployed on complex, domain-specific workflows -- navigating enterprise web applications that require dozens of clicks and form fills, orchestrating multi-step research pipelines that span search, extraction, and synthesis, automating code review across unfamiliar repositories, and handling customer escalations that demand nuanced domain knowledge. \textbf{Each new task domain requires painstaking, expert-driven harness engineering}: designing the prompts, tools, orchestration logic, and evaluation criteria that make a foundation model effective. We present a two-level framework that automates this process. At the first level, the \textbf{Harness Evolution Loop} optimizes a worker agent's harness $\mathcal{H}$ for a single task: a Worker Agent $W_{\mathcal{H}}$ executes the task, an Evaluator Agent $V$ adversarially diagnoses failures and scores performance, and an Evolution Agent $E$ modifies the harness based on the full history of prior attempts. At the second level, the \textbf{Meta-Evolution Loop} optimizes the evolution blueprint $Λ= (W_{\mathcal{H}}, \mathcal{H}^{(0)}, V, E)$ itself across diverse tasks, \textbf{learning a blueprint $Λ^{(\text{best})}$ that enables rapid harness convergence on any new task -- so that adapting an agent to a novel domain requires no human harness engineering at all.} We formalize the correspondence to meta-learning and present both algorithms. The framework \textbf{shifts manual harness engineering into automated harness engineering}, and takes one step further -- \textbf{automating the design of the automation itself}.
Summary / 总结
AI agents are increasingly deployed on complex, domain-specific workflows -- navigating enterprise web applications that require dozens of clicks and form fills, orchestrating multi-step research pipelines that span search, extraction, and synthesis, automating code review across unfamiliar repositories, and handling customer escalations that demand nuanced domain knowledge.
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
Authors: Ravi Ranjan, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou
Venue: ACL
First: 2026-05-01T21:49:20+00:00 · Latest: 2026-05-01T21:49:20+00:00
Comments: 18 pages, 6 figures, 7 tables, accepted to conference ACL-2026, BEA
Abstract
Large language models (LLMs) can provide automated feedback in educational settings, but aligning an LLMs style with a specific instructors tone while maintaining diagnostic correctness remains challenging. We ask how can we update an LLM for automated feedback generation to align with a target instructors style without sacrificing core knowledge? We study how Reinforcement Learning from Human Feedback (RLHF) can adapt a transformer-based LLM to generate programming feedback that matches a professors grading voice. We introduce PERSA, an RLHF pipeline that combines supervised fine-tuning on professor demonstrations, reward modeling from pairwise preferences, and Proximal Policy Optimization (PPO), while deliberately constraining learning to style-bearing components. Motivated by analyses of transformer internals, PERSA applies parameter efficient fine-tuning. It updates only the top transformer blocks and their feed-forward projections, minimizing global parameter drift while increasing stylistic controllability. We evaluate our proposed approach on three code-feedback benchmarks (APPS, PyFiXV, and CodeReviewQA) using complementary metrics for style alignment and fidelity. Across both Llama-3 and Gemma-2 backbones, PERSA delivers the strongest professor-style transfer while retaining correctness, for example on APPS, it boosts Style Alignment Score (SAC) to 96.2% (from 34.8% for Base) with Correctness Accuracy (CA) up to 100% on Llama-3, and Gemma-2. Overall, PERSA offers a practical route to personalized educational feedback by aligning both what it says (content correctness) and, crucially, how it says it (instructor-like tone and structure).
Summary / 总结
Large language models (LLMs) can provide automated feedback in educational settings, but aligning an LLMs style with a specific instructors tone while maintaining diagnostic correctness remains challenging.
Split-on-Share: Mixture of Sparse Experts for Task-Agnostic Continual Learning
Authors: Fatema Siddika, Md Anwar Hossen, Tanwi Mallick, Ali Jannesari
First: 2026-01-24T22:39:22+00:00 · Latest: 2026-05-01T18:20:49+00:00
Comments: we are updating the paper and will release another version soon
Abstract
Continual learning in Large Language Models (LLMs) is hindered by the plasticity-stability dilemma, where acquiring new capabilities often leads to catastrophic forgetting of previous knowledge. Existing methods typically treat parameters uniformly, failing to distinguish between specific task knowledge and shared capabilities. We introduce Mixture of Sparse Experts for Task-Agnostic Continual Learning, referred to as SETA, a framework that resolves the plasticity-stability conflict by decomposing the model into modular subspaces. Unlike standard updates, where tasks compete for the same parameters, SETA separates knowledge into unique experts, designed to isolate task-specific patterns, and shared experts, responsible for capturing common features. This structure is maintained through elastic weight anchoring, which protects critical shared knowledge and enables a unified gating network to automatically retrieve the correct expert combination for each task during inference. Extensive experiments across diverse domain-specific and general benchmarks demonstrate that SETA consistently outperforms state-of-the-art parameter-efficient fine-tuning-based continual learning methods.
Summary / 总结
Continual learning in Large Language Models (LLMs) is hindered by the plasticity-stability dilemma, where acquiring new capabilities often leads to catastrophic forgetting of previous knowledge.
Probabilistic Predictions of Process-Induced Deformation in Carbon/Epoxy Composites Using a Deep Operator Network
Authors: Elham Kiyani, Amit Makarand Deshpande, Madhura Limaye, Zhiwei Gao, Zongren Zou, Sai Aditya Pradeep, Srikanth Pilla, Gang Li, Zhen Li, George Em Karniadakis
First: 2025-12-15T03:04:45+00:00 · Latest: 2026-05-01T14:40:04+00:00
Comments: 21 pages, 13 figures
Abstract
Fiber reinforcement and polymer matrix respond differently to manufacturing conditions due to mismatch in coefficient of thermal expansion and matrix shrinkage during curing of thermosets. These heterogeneities generate residual stresses over multiple length scales, whose partial release leads to process-induced deformation (PID), requiring accurate prediction and mitigation via optimized non-isothermal cure cycles. This study considers a unidirectional AS4 carbon fiber/amine bi-functional epoxy prepreg and models PID using a two-mechanism framework that accounts for thermal expansion/shrinkage and cure shrinkage. The model is validated against manufacturing trials to identify initial and boundary conditions, then used to generate PID responses for a diverse set of non-isothermal cure cycles (time-temperature profiles). Building on this physics-based foundation, we develop a data-driven surrogate based on Deep Operator Networks (DeepONets). A DeepONet is trained on a dataset combining high-fidelity simulations with targeted experimental measurements of PID. We extend this to a Feature-wise Linear Modulation (FiLM) DeepONet, where branch-network features are modulated by external parameters, including the initial degree of cure, enabling prediction of time histories of degree of cure, viscosity, and deformation. Because experimental data are available only at limited time instances (for example, final deformation), we use transfer learning: simulation-trained trunk and branch networks are fixed and only the final layer is updated using measured final deformation. Finally, we augment the framework with Ensemble Kalman Inversion (EKI) to quantify uncertainty under experimental conditions and to support optimization of cure schedules for reduced PID in composites.
Summary / 总结
Fiber reinforcement and polymer matrix respond differently to manufacturing conditions due to mismatch in coefficient of thermal expansion and matrix shrinkage during curing of thermosets.
H-RAG at SemEval-2026 Task 8: Hierarchical Parent-Child Retrieval for Multi-Turn RAG Conversations
Authors: Passant Elchafei, Hossam Emam, Mohamed Alansary, Monorama Swain, Markus Schedl
First: 2026-05-01T13:07:19+00:00 · Latest: 2026-05-01T13:07:19+00:00
Abstract
We present H-RAG, our submission to SemEval-2026 Task 8 (MTRAGEval), addressing both Task A (Retrieval) and Task C (Generation with Retrieved Passages). Task A evaluates standalone retrieval quality, while Task C assesses end-to-end retrieval-augmented generation (RAG) in multi-turn conversational settings, requiring both accurate answer generation and faithful grounding in retrieved evidence. Our approach implements a hierarchical parent-child RAG pipeline that separates fine-grained child-level retrieval from parent-level context reconstruction during generation. Documents are segmented into overlapping sentence-based child chunks, while full documents are preserved as parent units to provide coherent context. Retrieval combines hybrid dense-sparse search, tunable weighting, and embedding-based similarity rescoring over child chunks. Retrieved evidence is aggregated at the parent level and supplied to an instruction-tuned language model for response generation. H-RAG achieves an nDCG@5 score of 0.4271 on Task A and a harmonic mean score of 0.3241 on Task C (RB_agg: 0.2488, RL_F: 0.2703, RB_llm: 0.6508), underscoring the importance of retrieval configuration and parent-level aggregation in multi-turn RAG performance.
Summary / 总结
We present H-RAG, our submission to SemEval-2026 Task 8 (MTRAGEval), addressing both Task A (Retrieval) and Task C (Generation with Retrieved Passages).
Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation
Authors: Lakshan Cooray, Deshan Sumanathilaka, Pattigadapa Venkatesh Raju
First: 2026-01-31T11:27:25+00:00 · Latest: 2026-05-01T12:18:02+00:00
Comments: Submission Accepted at Frontiers in Artificial Intelligence, Natural Language Processing Section
Abstract
Customer-service question answering (QA) systems increasingly rely on conversational language understanding. While Large Language Models (LLMs) achieve strong performance, their high computational cost and deployment constraints limit practical use in resource-constrained environments. Small Language Models (SLMs) provide a more efficient alternative, yet their effectiveness for multi-turn customer-service QA remains underexplored, particularly in scenarios requiring dialogue continuity and contextual understanding. This study investigates instruction-tuned SLMs for context-summarized multi-turn customer-service QA, using a history summarization strategy to preserve essential conversational state. We also introduce a conversation stage-based qualitative analysis to evaluate model behavior across different phases of customer-service interactions. Nine instruction-tuned low-parameterized SLMs are evaluated against three commercial LLMs using lexical and semantic similarity metrics alongside qualitative assessments, including human evaluation and LLM-as-a-judge methods. Results show notable variation across SLMs, with some models demonstrating near-LLM performance, while others struggle to maintain dialogue continuity and contextual alignment. These findings highlight both the potential and current limitations of low-parameterized language models for real-world customer-service QA systems.
Summary / 总结
Customer-service question answering (QA) systems increasingly rely on conversational language understanding.