Written by Kaili

With the development of large-scale language models, ChatGPT has led the emergence of various applications such as ChatPDF, BingGPT, NotionAI and so on. Much attention has been paid to the rapid progress of model generation, but little attention has been paid to the indispensable Embedding model that underpins the implementation of many large language models. This article will mainly introduce why Embedding models are very important in large language models, the current mainstream Embedding training methods, and some of our thoughts on the preliminary exploration of Embedding models.

First, Embedding technology introduction and historical summary

In machine learning and natural language processing, Embedding models refer to the process of mapping high-dimensional data (e.g., text, pictures, videos) to low-dimensional space. In simple terms, an embedding vector is an N-dimensional real-valued vector that represents the input data as points in a continuous numeric space. This article focuses on text embedding.

Why are Embedding models important in big language models?

Embedding is important because it can represent the semantics of a word or statement. Real-valued vector embedding can represent the semantics of words, mainly because these embedding vectors are learned based on the pattern of occurrence of words in the context of language. For example, if a word often appears with another word in some context, then the embedding vectors of the two words will have similar positions in the vector space. This means that they have similar meanings and semantics.

The concept of Embedding dates back to the mid-20th century, when Harris proposed a theory of distributed semantics. By the 80s of the 20th century, people began to try to use neural networks to learn embedding representations of words. Since 2010, with the development of deep learning technology, static vector Embedding represented by Word2Vec, GloVe, FastText and the use of ELMo, GPT, and BERT as representatives to generate context-sensitive dynamic vector embedding, the latter can better capture the semantic and contextual information of words.

Second, the value of Embedding in the big model

As mentioned earlier, and it is also well known to us, embedding vectors contain semantic information, and the more similar the meaning of words, the closer the position of embedding vectors in space. Real-valued vector embedding can perform vector operations and share and transfer in different natural language processing tasks by learning the semantic and contextual information of words from large amounts of data.

However, this was the value before Embedding. In the era of big language models, what is the new value of Embedding?

This starts with the flaws of the ChatGPT-like model. Despite their capabilities, there are still the following problems:

The training data is not real-time (for example, ChatGPT is based on data training before September 2021), and the retraining cost is too high and unrealistic
There is a limit to the length of input text, usually between a few thousand and tens of thousands of tokens
You can't access documents that can't be made public

In response, OpenAI published a document explaining how to solve the problem that GPT cannot process long text and up-to-date data based on embedding using a two-step search method. Two-step search, that is, first search the text library to find the relevant text part, and then add the retrieved text part to the input of the ChatGPT-like model to get the reply.

To illustrate a representative application, when we want the large model to respond to questions based on our given PDF document, we can chunk the very long PDF, get the embedding of each chunked content, and use vector database storage. Next, when you ask the question "How is xxx implemented in the documentation?" , you can use your problem embedding to search the database to get the PDF content block embedding that is most similar to the problem embedding. Finally, the retrieved PDF content block is entered into the model together with the problem to solve the problem of new knowledge and ultra-long text input.

Therefore, although the current discussion is not hot, the exploration of embedding models is indispensable for the landing of large language models.

Third, mainstream Embedding training methods

As mentioned earlier, OpenAI has long proposed Embedding-based search solutions to solve the problem of long text input and up-to-date data. Naturally, OpenAI also has an Embedding scheme with undisclosed training details: text-embedding-ada-002. This is OpenAI's second-generation Embedding model, which uses only one model to complete three downstream tasks simultaneously: text search, text similarity, and code search. Compared with the first-generation model, which is divided into five models to complete the above three tasks, the second-generation model is simplified to a single model and shows better performance in both Chinese and English tasks.

In this chapter, we will tease out some of the mainstream Embedding training methods. In recent years, much of Sentence Embedding's work has been based on BERT-like models. Taking Embedding from a model based on the Decoder structure, there is only a few researched and disclosed code. The training details of OpenAI's Embedding paper are also unclear. Therefore, in this chapter, we mainly tease out some representative methods of Sentence Embedding based on BERT-like models. Reflections on the exploration of acquiring Embedding based on the Decoder structural model will be discussed in Chapter 4.

In the pre-BERT era, word embedding trained by word2vec was generally used in combination with pooling strategy for sentence vector representation. In the BERT era, people took advantage of the inherent advantages of pre-trained language models, first using the [CLS] vector of the BERT model as a sentence vector representation, and then Sentence-BERT cleverly used the framework of the twin network model to obtain sentence vectors, and then BERT-Flow, BERT-Whitening, SimCSE, R-Drop, ESimCSE and other work. Among them, the more well-known are BERT-whitening and SimCSE, and then a lot of work has been based on comparative learning, and the data level and training level of constructing positive and negative sample pairs have been improved. This section provides a brief overview of this class of methods.

Since most of the recent Sentence Embedding work revolves around contrastive learning, let's first recall the basics of contrastive learning.

Compare learning contexts

Comparative learning is "to effectively learn data representations with the goal of bringing similar data closer and farther away from similar data". Given a pair of sample sets, where sum is similar samples, the optimization objective generally uses the cross-entropy loss function via in-batch negetives, as follows:

where the sum is the sentence vector of the sum, N is the size of the batch during training, the vector and cosine similarity, and the temperature hyperparameter.

Classic method

In recent years, since the birth of SimCSE, the field of sentence embedding has also caused a small wave of research boom. In this section, we mainly summarize the three works of SimCSE in a similar period (SimCSE, ESimCSE, CoSENT) in relative detail, and briefly summarize the subsequent part of the representative work.

SimCSE

SimCSE is one of the most out-of-the-loop jobs in the field of sentence embedding.

It is divided into two versions:

Unsupervised version of SimCSE: positive samples come from two similar representations resulting from applying different dropout masks to the same sentence, negative examples use in-batch negatives;
The supervised version of SimCSE constructs positive and negative samples based on the NLI dataset, with positive examples taking sentence pairs with relationships, negative examples taking sentence pairs with contradictory relationships (difficult negative examples) and in-batch negatives.

The above is the core idea of SimCSE, which is simple and effective, and at the same time very enlightening, leading a wave of research on sentence embedding technology.

ESimCSE

ESimCSE improves SimCSE from the perspective of positive and negative sample construction, respectively.

(1) Construction method of positive example pair:

Since SimCSE is constructed by adjusting the dropout rate, the positive case pairs are the same length, and the negative examples are of different lengths, which makes the model tend to judge sentences of the same or similar length to be more similar in expression.

To alleviate this problem, ESimCSE chooses to repeat some words randomly in sentences, which can change the length of the sentence without changing its semantics.

(2) Construction method of negative example pairs:

In contrastive learning, theoretically, the more negative pairs, the better the comparison between pairs. ESimCSE also followed this idea, but instead of directly forcing the large batch size, it maintained a queue, reused the coding embedding immediately preceding the mini-batch to extend the negative pair, and used a momentum encoder. The specific method is: since the queued sentence embedding comes from the previous mini-batch, you can take the moving average of its parameters to maintain the momentum update model, and use the momentum model to generate the queued sentence embedding. When using a momentum encoder, turn off dropout to close the gap between training and prediction. The parameters of the encoder and the momentum update The parameters of the encoder are updated, calculated according to the following formula:

is the momentum coefficient parameter. Note that only parameters are updated through backpropagation. We introduce here to generate sentence embeddings for queues, because momentum updates can make the ratio more stable than evolution. Therefore, although embeddings in the queue are encoded by different encoders (in different "steps" during training), the differences between these encoders can be minimal.

CoSENT

Early Sentence-BERT had problems with training, prediction inconsistencies, and tuning difficulties. However, if the prediction target COS value is directly optimized, the effect is often particularly poor. Isn't there a way to directly optimize the COS value?

Fortunately, the answer is no. Professor Su Jianlin proposed a CoSENT scheme, a loss function that optimizes the cos value:

Denoted that all sets of positive sample pairs and all sets of negative sample pairs , then we want to have both positive and negative pair pairs

where are their respective sentence vectors. To put it bluntly, we only want the similarity of positive sample pairs to be greater than the similarity of negative sample pairs, and as for how much larger, the model decides for itself. In fact, Spearman, a common measure of semantic similarity, is the same, which relies only on the relative order of predicted results, not on specific values.

For such needs, you can use the formula in Circle Loss theory as a solution:

Simply put, if you want to achieve it eventually, then add an item to the log. Corresponding to our scenario here, we can get the loss function:

where is a hyperparameter. The above formula is essentially a loss function designed for sorting, and the same applies to multi-categorical data, written in a more general form:

That is, as long as we think that the true similarity of the sample to (i,j) should be greater than the true similarity of (k,l), we can add it to the log; In other words, as long as we can design the order for the sample pair, we can use the CoSENT scheme.

For NLI data, it has three labels of "containment", "neutrality" and "contradiction", we can naturally think that the similarity of two "implied" sentences is greater than that of two "neutral" sentences, and the similarity of two "neutral" sentences is greater than that of two "contradictory" sentences, so that based on these three labels, the NLI sentences can be sorted in pairs. With this sorting, NLI data can also be trained with CoSENT. Similarly, data such as STS-B, which is itself scoring, is more applicable to CoSENT, because the scoring label itself is sorting information.

Summary of the work of the next part

SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with Soft Negative Samples

Aiming at the problem that the model "cannot distinguish between text similarity and semantic similarity, and prefers to have similar texts without considering the actual semantic differences", a scheme of "explicitly adding negative words to generate soft negative samples" combined with "two-way marginal loss" is proposed.

EASE: Entity-Aware Contrastive Learning of Sentence Embedding

Emphasize the importance of entities in sentence vector representation. At the data level, positive and negative entities are used instead of positive and negative samples.

CLAIF：Improving Contrastive Learning of Sentence Embeddings from AI Feedback

Aiming at the lack of fine-grained supervised signals in the training process, that is, without considering the similarity differences between positive sample pairs, AI feedback from LLM is introduced to construct sample pairs with different similarity, and fine-grained similarity scores are given to these sample pairs as supervised signals to help the learning of text representation.

PromptBERT

PromptBERT is another classic in the field of sentence embedding after SimCSE.

At the heart of this work is the idea of using Prompt to generate sentence representations. The author believes that the poor performance of native BERT is mainly due to bias caused by tokens such as word frequency, capitalization, and subword, and BERT itself has not corrected this problem at all layers of the transformer. By using prompt, the knowledge in each layer of BERT can be used more effectively, and if embedding is expressed by [MASK], it is possible to avoid averaging various tokens as before, thus avoiding the bias of token introduction.

The core idea of this working method is also relatively simple, divided into two steps:

Use Prompt to generate sentence representations, such as [X] means [MASK], [X] is the input sentence, [MASK] is the output representation, use this as a sentence representation
Use different Prompt templates to generate contrastive learning angles, and continue to train in a self-supervised way

Instrcutor Embedding

According to OpenAI's paper "Text and Code Embeddings by Contrastive Pre-Training," text similarity and semantic retrieval are two different tasks, and training goals may conflict. As training progresses, if the model is likely to perform better on the semantic search task, the worse it may perform on the sentence similarity task. At the same time, existing Embedding models often perform poorly when faced with new tasks and new areas.

And our ideal Embedding should obviously have multiple capabilities at the same time. How can the Embedding model be adapted to multiple tasks at the same time and generalize in new areas?

Instrcutor Embedding has devised a new method of text embedding based on instruction fine-tuning: stitching instructions (containing task and domain information) that explain use cases before text input. Instrctor Embedding hand-wrote task instructions for 330 text embedding datasets during training and evaluated INSTRUCTOR on 70 embedding evaluation tasks (64 of which were not seen during training), ranging from classification and information retrieval to semantic text similarity and text generation evaluation, achieving a good performance overall.

Fourth, Embedding related exploration and thinking

The previous chapter teased out the representative work of Sentence Embedding based on BERT-like models. In fact, it seems reasonable that BERT-like models that use the two-way attention mechanism are good at content understanding tasks. But the good effect of the OpenAI Embedding model and OpenAI's insistence on the Decoder-Only architecture model, as well as the rapid development of large models in the past six months, make us wonder whether the large model of Decoder-only is possible to surprise us in the Embedding task?

In this regard, we have made some attempts to explore classes. In the process of exploration, we hope to clarify two problems:

Are BERT-like models really better suited to Embedding tasks than Decoder-Only models?
Is a bigger model better for Embedding tasks?

Finally, after our exploration of the decoder-only model padding mode, pooling method, and the degree of anisotropy of different layers, the final conclusion is relatively consistent with the current partial public conclusion.

In response to the first question, "How Contextual are Contextualized Word Representations?" Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings" paper carried out an exploration experiment comparing the effects of different layers of BERT and GPT, and the experimental effects are as follows:

Based on the table above, you can find:

At different layers, the overall effect of BERT is significantly better than GPT
GPT-2 last layer is more anisotropic, and the middle or lower layer is more suitable for similarity tasks than the top layer

In response to the second problem, the Instructor Embedding paper also gives the effect comparison experiment of different parameter quantity models, as shown in the following table: