AI4Science 论文速递

Snapshot: 20260303_0332

Reinforcement Learning from Human Feedback

Authors: Nathan Lambert

First: 2025-04-16T21:36:46+00:00 · Latest: 2026-02-27T18:22:58+00:00

Comments: 204 pages. Web-native version at https://rlhfbook.com/ Continually improving, latest version at website

Abstract

Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF -- both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics -- understudied research questions in synthetic data and evaluation -- and open questions for the field.

Summary / 总结

Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems.

Test-Time Training with KV Binding Is Secretly Linear Attention

Authors: Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, Ruilong Li

First: 2026-02-24T18:59:30+00:00 · Latest: 2026-02-27T15:30:32+00:00

Comments: Webpage: https://research.nvidia.com/labs/sil/projects/tttla/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity.

Summary / 总结

Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time.

Task Complexity Matters: An Empirical Study of Reasoning in LLMs for Sentiment Analysis

Authors: Donghao Huang, Zhaoxia Wang

First: 2026-02-27T14:49:05+00:00 · Latest: 2026-02-27T14:49:05+00:00

Comments: 12 pages, 1 figure, 3 tables. Accepted at PAKDD 2026