🤗 Model | 📂 Github | ✍️ Blog

Introduction


The process of training a language model to generate responses according to human instructions can be divided into two main steps. The first step involves training the language model on large corpus of data to predict next token, allowing the model to acquire a basic understanding of language. In the second step, the model is tuned on instruction data consisting of pairs of instructions and responses, so it can generate appropriate answers to user requests. Simple pre-training alone may not enable the model to respond appropriately to user requests, but a model that has undergone instruction tuning is likely to provide more accurate answers to a wide of user instructions.

The approaches of pre-training and instruction tuning are somewhat different from the ultimate goal of having the model output content that is preferred by humans. To address this, a reinforcement learning approach using human feedback is employed. In this method, a reward model is designed to evaluate the quality of the output using pairs of selected and rejected answers from the model. Through RL with PPO the model is trained to generate content that is preferred by humans. However, the PPO method is sensitive to parameters and requires multiple models (Policy Model, Reference Model, Reward Model, Score Model), which consume significant resources.

To address this, this project compares the DPO (Direct Preference Optimization) method, which directly compare two sentences through supervised learning without a reward model, and the RLOO(REINFORCE Leave-One-Out) method, which is relatively insensitive to parameter changes. The goal is to design a robust Korean summarization model.

스크린샷 2024-11-03 오전 1.40.42.png

Preliminaries


Supervised Instruction Tuning + NEFTune

$\mathcal{L}{\text{SFT}}=-\sum{i=n+1}^N\log\pi_\theta(y|x_i)$

where:

The following is a pseudo code that the noise addition process in NEFTune. ($\alpha=5$).


Require dataset $\mathcal{D} = \{(x_i, y_i)\}{i=1}^N$ **, embedding layer $\text{emb}(\cdot)$, rest of model *$f{\text{emb}}(\cdot)$*, model parameters $\theta$, loss $\mathcal{L}_{\text{SFT}}=\text{loss}(\cdot)$, optimizer $\text{opt}(\cdot)$ \Require Hyperparameter: base noise scale = $\alpha \in \mathbb{R}^+$