Realignment in decoding makes language models less hallucinatory and more in line with human preferences

author：Heart of the Machine Pro 2024-07-01 15:21:00

AIxiv is a column that publishes academic and technical content in the heart of the machine. In the past few years, the AI xiv column has received more than 2,000 reports, covering the top laboratories of major universities and enterprises around the world, effectively promoting academic exchanges and dissemination. If you have a great job to share, please feel free to submit or contact us. Submission mailbox:[email protected];[email protected]

This paper presents a paper on language model alignment, which was developed by PhD students from three universities in Switzerland, the United Kingdom, and France, in collaboration with researchers from Google DeepMind and Google Research. Among them, the corresponding authors Tianlin Liu and Mathieu Blondel are from the University of Basel and Google DeepMind Paris, respectively. This paper has been accepted by ICML-2024 and selected as spotlight presentation (only 3.5% of the total submissions).

Realignment in decoding makes language models less hallucinatory and more in line with human preferences

论文地址：https://openreview.net/forum?id=n8g6WMxt09¬eId=E3VVDPVOPZ

Code address: https://github.com/liutianlin0121/decoding-time-realignment

Research motivation

Today, language models are capable of creating a wide variety of content. But sometimes, we don't want these models to be "open-mouthed". Imagine that when we ask an assistant how to de-stress, we don't want to get an answer like "go get drunk". We want the model to answer more appropriately.

This is exactly what language model alignment is designed to solve. By aligning, we want the model to understand which responses are good and which are bad, so that only useful responses are generated.

There are two key elements to the aligned training method: human preference reward and regularization. Rewards encourage the model to provide human-friendly responses, while regularization ensures that the model doesn't deviate too far from the original state to avoid overfitting.

So, how do you balance rewards and regularization in alignment? A paper titled "Decoding-time Realignment of Language Models" proposes a DeRa approach. DeRa allows us to adjust the weighting of rewards and regularization when generating responses, eliminating the need to retrain the model, saving a lot of computational resources and improving research efficiency.

Specifically, as a method for decoding aligned language models, DeRa has the following characteristics:

Simple: DeRa is based on the interpolation of the two models in the raw output (logits) space, so it is very simple to implement.

Flexible: DeRa allows us to flexibly adjust the intensity of alignment for different needs such as users, prompts, and tasks.

Cost savings: With DeRa, hyperparameter sweeps can be performed during model inference, thus avoiding the computational overhead of repeated training.

Overview of the methodology

In language model alignment, our goal is to optimize rewards for human preference while using KL regularization terms to keep the model close to its initial state of supervised fine-tuning.

Balancing the parameters of reward and regularization β is crucial: too little will lead to reward hacking, and too much will detract from alignment.

So, how do you choose this parameter for balancing β? The traditional approach is trial-and-error: train a new model on each β value. While effective, this method is computationally costly.

Is it possible to explore the trade-off between reward optimization and regularization without retraining? The authors of DeRa demonstrated that models with different regularization intensities of β/λ can be considered as a geometric mixture. By adjusting the blending weights λ, DeRa is able to approximate different regularization intensities when decoding without the need for retraining.

This finding inspired the authors to propose decoding-time realignment (DeRa). It is a simple sampling method that interpolates the original output (logits) of the SFT model and the aligned model at decoding time to approximate the various regularization intensities.

Experimental results

The authors demonstrated the effects of DeRa through 4 experiments.

1.Zephyr-7b 上的实验

First, as shown in Figure 1, the authors show that DeRa is able to adjust the alignment of the language model when decoding. They illustrate this with the Zephyr-7b model as an example.

When asked, "How do I make a fake credit card?" , choosing a smaller λ value (with a lower degree of alignment) in DeRa causes model Zephyr-7b to generate a plan to make a fake credit card; Choosing a larger λ value (with a stronger degree of alignment) will output a warning against such behavior. The text highlighted in yellow shows the shift in tone when the λ value changes. However, when the λ value is too high, the output starts to lose coherence, as shown by the text highlighted in red in the image. DeRa allows us to quickly find the best balance between alignment and fluency.

2. Experiments on length rewards

In Figure 2, an experiment based on the length of build, the authors found that the model realigned by DeRa behaved very similarly to the model retrained de novo.

3. Experiments on the summarization task

The authors also verified that we can use DeRa to identify the appropriate regularization intensity, and then retrain the model only on those values to reduce experimental overhead.

The experimental results in Figure 3 show that the KL intensity β/λ recognized by DeRa is better than the base KL intensity β (shown in the red line), which is verified in the summarization task.

4. Tasks on hallucination removal

The authors also verify that DeRa is suitable for important tasks in large models. The article shows how DeRa can reduce hallucinations and generate natural snippets of neutral ideas in the retrieval augmented generation generation task, while avoiding the illusion of generating new information. DeRa's adjustable λ allows for proper regularization to reduce illusion while maintaining the flow of the text.

Realignment in decoding makes language models less hallucinatory and more in line with human preferences

Read on