laitimes

From pre-training to fine-tuning: Nextdoor's path to effective application embedding - translations

author:Flash Gene

background

Most of Nextdoor's ML models are typically driven by a large number of features, which are mostly continuous or discrete. Personalization traits are typically derived from historical aggregation or real-time aggregation of interactive traits, typically captured through recorded trace events. However, representing content through a deep understanding of the information (text/image) behind it is essential to model subtle user signals and better personalize the complex user behavior of many of our products. In the rapidly evolving field of NLP, it is becoming increasingly important to leverage Transformer models for effective and efficient representation learning to understand users and improve their product experience.

To do this, we've built a number of entity embedding models, covering entities such as posts, comments, users, search queries, and taxonomy information. We start by leveraging a deep understanding of the content and deriving embeddings for meta-entities such as users based on what the user has interacted with in the past. These powerful representations are essential for extracting meaningful features from some of the largest ML ranking systems in Nextdoor, such as notification scoring and feed ranking. By making them readily available and built at scale, we can reliably drive the adoption of state-of-the-art technologies and put them in the hands of ML engineers to quickly build high-performance models across the company.

This blog focuses on how we iteratively developed embedding models, how we characterized them and applied them to various product applications at scale, and some of the challenges encountered along the way. We have summarized the progress of the work in three parts. In Part 1, the focus is on leveraging state-of-the-art pre-trained models to quickly evaluate the value of embedded models as feature extractors. Section 2 describes how to fine-tune embeddings for certain products using unlabeled data, while Section 3 demonstrates how to fine-tune embeddings using labeled data for better task prediction. This work is driven by Nextdoor's Knowledge Graph team, a horizontal team that works closely with the product ML team as well as the ML platform team with the ML training and services platform, and the FeatureStore service that powers Nextdoor's ML models.

1. Take advantage of pre-trained models

The first generation of embeddings was built from a pretrained language model using the Sentence-BERT paradigm (https://www.sbert.net/). It is well known that SBERT can produce better embedding representations compared to the original BERT model [1]. The main goal here is to quickly experiment with embeddings as features and realize their value in the product as quickly as possible. The text of the content entities (i.e., Nextdoor posts and comments) is extracted from the subject and body of the post and the comment text, respectively, and then fed into the multilingual text embedding model to derive the corresponding entity embedding in all the countries where Nextdoor operates. For a given user, the embedding of their historical interaction posts is weighted aggregated based on the type of interaction to inform the user (interaction) of the embedding. For example: Active interactions such as post creation/comments/clicks have a higher weight compared to more passive interactions such as impressions. These signals are aggregated on both online (feed) and offline (email) product surfaces to represent user embedding as a whole and are updated daily for all users on the platform.

These features were found to be one of the most important of several ranking models and significantly improved the performance of key product OKR metrics in notifications and streams when they were released in early 2022. The pre-trained model also acts as a good proof of concept to build a reliable functional extraction pipeline and monitoring system that identifies any potential functional drift and disruptions. This helps form a robust playbook for deploying several next-generation embedding capabilities.

From pre-training to fine-tuning: Nextdoor's path to effective application embedding - translations

2. Fine-tune embeddings from unlabeled data

Next-generation embedding describes training custom models that improve on the pre-trained version by leveraging fine-tuning techniques. The signals previously used to generate embeddings came directly or indirectly from the user's interaction with the Notifications and Home Activity products. In contrast, this section details a use case for representation learning using unlabeled data to improve the user search experience.

Our neighbors use Nextdoor search to find useful local information by clearly expressing intent. We try to capture both long-term and short-term intentions to identify and meet users' long-term needs (e.g., home maintenance) as well as short-term needs (e.g., lost and found). Search queries – while intent in nature, are short and noisy in nature. Searchers may attempt multiple variations of a query in a row to best meet their intent. Additionally, due to the nature of local search, relying on tagged feedback in search results may not fully capture user intent due to limited liquidity.

To fully capture user intent signals, we rely on a self-supervised training strategy to learn a fine-tuned representation of any given query. Specifically, we first build a SBERT-powered query embedding model that learns to embed search queries in a low-dimensional space. We then aggregate the embeddings of user queries from different time windows (weekly/monthly/quarterly) to generate multiple user (intent) embeddings. The same model also extracts the intent of the post to generate the corresponding post embedding. The resulting user, post, and query embeddings will be transformed and characterized as described in later sections to improve the performance of the ranking model.

The query embedding model was originally built to drive contextual query extensions in the Nextdoor search pipeline [2]. This sentence converter model is trained on historical search queries in order to best learn query representations. We started by collecting search logs consisting of a sequence of in-session search queries by all searchers over a period of time. Then, they are preprocessed using traditional NLP methods such as lemmogram reduction, spell checking, and data deduplication to form a clean markup corpus consisting of n-gram (n=1,2,3) and the entire query. To generate the training dataset, we created positive pairs of labels that appear in a user's search session and negative pairs that appear randomly in the session. Contrastive learning for cosine similarity loss is used to train the underlying model.

For query scaling use cases, this model improves recall by identifying relevant candidates, resulting in better contextual search results. Not only did this help improve the key search metrics for content search and product search in For Sale & Free, but it also significantly reduced the null query rate compared to the previous word embedding model. We also utilize the approximate nearest neighbor database HSNWlib [3] to implement this deep learning-based query scaling, which further improves the scaling latency by more than 10x. For notification and feed use cases, the intent traits generated from post and user embedding conversions help to have a significant positive impact on our topline engagement metrics. Although traits can only be calculated for searchers and the overall coverage is low, this clear signal is found to be very useful for improving the overall search experience.

From pre-training to fine-tuning: Nextdoor's path to effective application embedding - translations

3. Fine-tune the embedding with marker feedback

In the next evolution of embedding, we will also use user feedback to further fine-tune the model. Pretrained entity embeddings have been working for us for over a year, but they are out-of-the-box models trained using public benchmark datasets. As a result, their semantics are fundamentally different from the Nextdoor domain. In addition, their high but fixed model dimensions result in significant storage and service costs, especially when user embedding is updated daily for all Nextdoor neighbors. To address these issues, we built a twin-tower framework that uses user feedback gathered on the Nextdoor surface to fine-tune embedding while reducing dimensions, customizing to our domain, and improving cost-effectiveness.

Fine-tuned models are developed and trained in phases, with increasing complexity. In the first stage, the inputs to posts and user towers are pre-trained embeddings and then transformed using multiple FC layers, reducing the dimensionality at each step. Standard cross-entropy is used as a loss function to predict notification click tasks for a given user and post. To generate the training dataset, we sample from random exploration logs to reduce selection bias, which is the same process as for downstream ranking models. Once the model is fully trained, the final layer generates fine-tuned representations of users and posts.

These pytorch models are trained on millions of records using SageMaker GPU instances with different hyperparameters, and the model with the best offline performance is selected to generate fine-tuned embeddings and store them in the FeatureStore. Follow the playbook described earlier to build and monitor offline and online feature pipelines. Feeding these cached capabilities to downstream models has shown a welcome boost in all engagement metrics (CTR/Sessions/Contribution/DAU/WAU) while keeping guardrail metrics that measure the distribution of harmful/harmful content across the platform neutral.

In the next stage, we directly feed the text extracted from the post entity into the post tower so that we can fine-tune the parameters of the SBERT model. The test AUC score is used as a benchmark to determine how many layers and transformer blocks need to be thawed to try different training scenarios and to optimize the hyperparameters of a typical DNN model. The best model also improved the cosine similarity of fine-tuned embedded user posts by 16% compared to the corresponding pre-trained version – another evaluation criterion for characterizing the intrinsic quality of improvements. It's also worth noting that this quality improvement was achieved with a reduction in dimensions of more than 10 times!

In the most recent phase, we've expanded to model multi-task learning (MTL) setups, including notification clicks and feed actions, to jointly optimize fine-tuned embedded learning. Again, these targets accurately mimic downstream rankers to ensure that the learned embeddings directly optimize downstream tasks. Another advantage of the MTL model is that a single model can be learned across multiple product interfaces, reducing operational burden and maintenance costs, while leveraging knowledge transfer across shared tasks for better representation. The feed and notification interface are highly relevant, as clicking on an email notification takes you directly to a pinned view of the post in the newsfeed. In addition, most of the actions on the home page feed are used as a feature in the notification ranker, making these tasks very relevant.

From pre-training to fine-tuning: Nextdoor's path to effective application embedding - translations

Use embedding in ML models

Since most of our downstream production models are tree-based models, they are not directly integrated with vector features such as embeddings of deep neural networks. Therefore, we mainly use the output of the embedded model as a feature extractor for the downstream model. Specifically, we rely on transformations such as cosine similarity and dot product between these entity embeddings to generate meaningful affinity features. While the transition to neural network systems is currently underway – these vector transformations provide an ingenious way to integrate embedding-based features into existing models and allow for rapid experiments to evaluate the performance gains of new depth features.

We start by creating a schema and declaring the feature groups that correspond to each embed to host in our internal FeatureStore. Then, when creating/updating content-based embedding features, use task worker jobs to extract them into our Featurestore in near real-time. For users, daily scheduling jobs in Airflow calculate embedding aggregations based on a pre-specified lookback window, weighting various interaction types, and extracting them in bulk into the Featurestore. Once the system is set up to extract all relevant embeddings with the appropriate TTL, we write log code to calculate and record derived features such as user-to-post, user-to-user cosine similarity, and dot product. Specifically, in the feed, these characteristics will represent the affinity between the post and the viewer, between the viewer and the author, and between the post and the recipient and the sender and the recipient in the notification world. Similarly, we also calculate affinity between commenting entities to inform activity-based rankings in the news feed. The data obtained from the feature logs is used to train downstream ML ranking models to avoid online and offline bias, and the best models that use the new features to improve offline performance will be promoted for online AB test evaluation and further enhance the majority member experience.

Challenges and the future

Multiple entity embeddings (i.e., users, posts, comments, queries, etc.) have been successfully integrated into Nextdoor's various product interfaces at scale. In the past, comment-based embedding models have helped facilitate and foster friendlier conversations, leading to higher platform vitality metrics [4]. More recently, contextual topic embedding has also been developed using BERTopic [5] to enable more coarse content personalization to neighbors, while informing us about the popularity of content categories and types on the platform. We are also experimenting with image embedding using CLIP [6] to take advantage of the image/video information behind the content.

In addition, as an extension of the marker fine-tuning, we plan to further improve the presentation in two dimensions. One is to take advantage of multimodal and dense signals by connecting representations with additional features such as image embeddings and existing interactive features. The other is to extend the task to other levels, such as advertising, for sale, and free (marketplace), etc., to make the representation more comprehensive between products. Once the downstream model is fully modernized to a DNN-based approach, embedding can be integrated directly into the model without losing any information from the transformation calculations.

As we build more and more embeddings to capture different signals, we also need to be aware of the additional costs that come with new features. By performing embedding transformations directly in the FeatureStore, rather than passing embeddings between microservices, you can alleviate some of the initial challenges of providing high-dimensional vectors during inference, minimizing network bandwidth and scaling costs. This works well in tree-based models, but in the future, using the DNN model directly to provide embeddings may increase costs. Caching and providing fine-tuned embeddings can help control dimensions while incorporating domain-specific knowledge. This allows us to experiment quickly and quickly assess ROI on a smaller scale, justifying the overall cost. From an infrastructure perspective, we found that optimizing the payload format of embedded features and sorting calls to efficiently read/write from the FeatureStore can significantly reduce the overall overall cost.

Thanks

This would not have been possible without close collaboration and collaboration with various ML product partners (Notifications/Feeds/Search/Vitality) and the strong support of the ML Platform team for the ML Platform and FeatureStore services. I would like to take this opportunity to express my sincere gratitude to all of the Nextdoor employees on these teams who are dedicated to this work.

Nextdoor is building the world's largest Local Knowledge Graph (LKG). The local knowledge graph inherited by our community is proprietary data unique to Nextdoor that can be used to enable personalized community and neighborhood experiences. The Knowledge Graph team focuses on understanding neighbors and content by using state-of-the-art machine learning methods to create standardized neighbor/content data.

Third-party large language models (LLMs), such as GPT, and corresponding conversational applications built on top of these language models, such as ChatGPT, do not have access to Nextdoor's specific local knowledge. As a result, they are unable to provide location-based services to users as we would like. It was critical for us to develop an in-house custom LLM to leverage our unique local knowledge graph. We're building our own large language models (LLMs) that are based on Nextdoor's raw content and structured knowledge graph to power multiple products.

If you're interested in learning more, please contact us – we're hiring!

reference

Sentence-BERT:使用 Siamese BERT 网络进行句子嵌入

https://engblog.nextdoor.com/modernizing-our-search-stack-6a56ab87db4e

https://github.com/nmslib/hnswlib

https://engblog.nextdoor.com/using-predictive-technology-to-foster-constructive-conversations-4af437942bd4

https://maartengr.github.io/BERTopic/api/bertopic.html

https://openai.com/research/clip

Author: Kartiq Jayasurya

Source-WeChat public account: https://engblog.nextdoor.com/from-pre-trained-to-fine-tuned-nextdoors-path-to-effective-embedding-applications-3a13b56d91aa

Derivation:

Read on