laitimes

Use BGE-M3 to generate learning sparse embeddings

author:Cloud and cloud sentient beings
Use BGE-M3 to generate learning sparse embeddings

BGE-M3 is an ML model for creating learning sparse embeddings that combines precision and semantic richness for advanced natural language processing.

译自 Generate Learned Sparse Embeddings With BGE-M3,作者 Stephen Batifol。

Sometimes, developers need to make choices when choosing an LLM retrieval method. They can use traditional sparse embedding or dense embedding. Sparse embedding is great for the keyword matching process. We typically find sparse embeddings in natural language processing (NLP), and these high-dimensional embeddings often contain zero values. The dimensions in these embeddings represent tags in one (or more) languages. It uses a non-zero value to show the relevance of each tag to a particular document.

On the other hand, dense embeddings have lower dimensions, but they do not contain any zero values. As the name suggests, dense embedding is full of information. This makes dense embedding ideal for semantic search tasks, making it easier to match the spirit of "meaning" rather than an exact string.

BGE-M3 is a machine learning model used to create an advanced type of embedding called "learning sparse embedding". The advantage of these learning embeddings is that they combine the precision of sparse embeddings with the semantic richness of dense embeddings. The model uses tags in sparse embeddings to learn which other tags might be relevant or correlated, even if they weren't explicitly used in the original search string. Eventually, this will result in an embedding that contains a wealth of relevant information.

Learn about BERT

Bidirectional encoder representations (or BERTs) from Transformer are more than just superficial things. It is the underlying architecture that enables advanced machine learning models such as BGE-M3 and SPLADE.

BERT processes text differently than traditional models. Instead of just reading text strings in order, it examines everything at the same time and takes into account the relationships between all components. BERT uses a two-pronged approach to do this. These are separate pre-trained tasks that the model implements, but their outputs work together to enrich the meaning of the inputs.

  1. Masking Language Modeling (MLM): First, BERT randomly hides a portion of the input marker. It then uses the model to figure out which options make sense for the hidden part. To do this, it requires an understanding of not only the relationship between the order of words, but also how that order affects meaning.
  2. Next Sentence Prediction (NSP): While MLM primarily works at the sentence level, NSP amplifies further. This task ensures that sentences and paragraphs flow logically, so it learns to predict what makes sense in these broader contexts.

When a BERT model analyzes a query, each layer of the encoder is analyzed independently of the others. This allows each layer to produce unique results, independent of other encoders. The result is a richer, more robust data set.

It is important to understand the capabilities of BERT because BGE-M3 is based on BERT. The following example illustrates how BERT works.

BERT in action

Let's take a basic query as an example and see how BERT creates embeddings from it:

Milvus is a vector database built for scalable similarity search.

The first step is to convert the words in the query string to tags.

Use BGE-M3 to generate learning sparse embeddings

You'll notice that the model adds [CLS] at the beginning of the markup and [SEP] at the end. These components are simply markers that indicate the beginning and end of a sentence at the sentence level, respectively.

Next, you'll need to convert the tag to embedding.

Use BGE-M3 to generate learning sparse embeddings

The first part of this process is embedding. Here, the embedding matrix converts each marker into a vector. Next, BERT adds positional embeddings because the order of the words is important, and this embedding keeps those relative positions constant. Finally, paragraph embedding simply tracks breakpoints between sentences.

We can see that the embedding output is monochromatic at this point to represent sparse embedding. To achieve higher densities, these embeddings pass through multiple encoders. Just like the pre-trained tasks that work independently identified above, these encoders do the same. The embedding is constantly modified as it passes through the encoder. The tags in the sequence provide important context for refining the representation produced by each encoder.

Once this process is complete, the final output will be denser with embedding than the pre-encoder output. This is especially true when using a single tag for further processing or tasks that result in a single dense representation.

BGE-M3 enters the chat

BERT provides us with dense embeddings, but the goal here is sparse embeddings for generative learning. So now we can finally get our hands on the BGE-M3 model.

BGE-M3 is essentially an advanced machine learning model that pushes BERT even further by focusing on enhancing text representations through versatility, multilingualism, and multigranularity. All of this is to say that it's not just about creating dense embeddings through sparse embeddings of generative learning that offer the best of both worlds: word meaning and precise word choice.

Practical application of BGE-M3

Let's start with the same query as we understand BERT. Running the query results in the same sequence of contextual embeddings as seen above. We can call this output (Q).

Use BGE-M3 to generate learning sparse embeddings

The BGE-M3 model delves into these embeddings and attempts to understand the importance of each marker at a more fine-grained level. There are several aspects to this.

  • Marker Importance Estimation: BGE-M3 does not consider the [CLS] marker representation Q[0] as the only possible representation. It also evaluates the contextual embeddings of each marker Q[i] in the sequence.
  • Linear Transformation: The model also takes the BERT output and uses a linear layer to create importance weights for each marker. We can call the set of weights generated by BGE-M3 W_{lex}.
  • Activation function: BGE-M3 then applies the Linear Rectification Unit (ReLU) activation function to the product of W_{lex} and Q[i] to calculate the item weight w_{t} for each marker. Use ReLU to ensure that item weights are non-negative, which contributes to the sparsity of embedding.
  • Sparse embedding learned: The final output is a sparse embedding where each tag has a weighted value indicating its importance to the original input string.

BGE-M3 in real-world applications

Applying the BGE-M3 model to real-world use cases can help demonstrate the value of this machine learning model. These are areas where organizations can benefit from the model's ability to understand linguistic nuances in large amounts of textual data.

Customer support automation - chatbots and virtual assistants

You can use BGE-M3 to power chatbots and virtual assistants, significantly enhancing your customer support services. These chatbots can handle a variety of customer queries, provide instant responses, and understand complex questions and contextual information. They can also learn from interactions and improve over time.

Advantage:

  • Round-the-clock availability: Round-the-clock support for customers.
  • Cost-effective: Reduces the need for a large customer support team.
  • Improved customer experience: Fast and accurate responses increase customer satisfaction.
  • Scalability: A large number of queries can be processed at the same time, ensuring consistent service during peak hours.

Content generation and management for marketing and media

You can utilize BGE-M3 to generate high-quality content for blogs, social media, advertising, and more. It can create articles, social media posts, and even full reports based on the desired tone, style, and context. You can also use this model to summarize long-form documents, create summaries, and generate product descriptions.

Advantage:

  • Efficiency: Generate large amounts of content quickly.
  • Consistency: Maintain a consistent tone and style across different pieces of content.
  • Reduce costs: Reduce the need for large content creation teams.
  • Creativity: Helps in brainstorming and generating creative content ideas.

Medical Data Analysis - Clinical Documentation and Analysis

Developers in healthcare can use BGE-M3 to analyze clinical documents and patient records, extract relevant information, and help generate comprehensive medical reports. It can also help identify trends and insights from large volumes of medical data to support better patient care and research.

Advantage:

  • Save time: Reduce the amount of time healthcare professionals spend on documentation.
  • Accuracy: Improve the accuracy of medical records and reports.
  • Insight generation: Identify patterns and trends that can inform better clinical decision-making.
  • Compliance: Helps ensure that documentation meets regulatory standards.

conclusion

The BGE-M3 model provides a high degree of versatility and advanced natural language processing capabilities that are applied in various industries and sectors and can significantly improve operational efficiency and quality of service.

Read on