As a platform that integrates content sharing, community interaction and e-commerce shopping, Xiaohongshu's search function has become an important channel for people to obtain information and make consumption decisions. With the diversification of user needs and the rapid growth of advertising materials, Xiaohongshu's search ad recall system is facing multiple business and technical challenges.
This article deeply discusses Xiaohongshu's search advertising recall practice and thinking in the business growth stage, and introduces how to achieve efficient advertising distribution while ensuring user experience while building data loops and optimizing complex recall models through recall level analysis and strategy evolution.
In addition, the application of large model technology, such as complex query common sense reasoning and large model representation, has also brought new breakthroughs to the recall system. In the era of large models, the transformation of the search technology stack will be inevitable, and the Xiaohongshu search ad recall collaborative computing engine team is ready for the future.
Xiaohongshu is a platform that integrates high-quality content sharing, diverse and interesting community atmosphere, and from grass planting to e-commerce shopping, and has become an "encyclopedia of life" and "consumption decision-making portal" for many young consumers. Its built-in search portal is a general search engine, which accepts users to query a variety of search terms, and has high requirements for relevance, content quality, and user experience. In the search commercialization of Xiaohongshu, commercialization is not regarded as a consumption of user experience and content quality, but based on the understanding of users' life needs and the satisfaction and matching of merchants' marketing demands, promoting the continuous growth of supply and demand through the good experience of users and merchants, and driving the platform's revenue to rise in the long run.
As a search commercialization recall team, the recall stage is to find a batch of the most relevant and efficient advertising collections for user search from the huge advertising candidate pool at the upstream of the link, and solve the problem of supply and demand matching between the fast-growing user search intent and the advertiser's marketing intent by solving several technical problems of retrieval, rewriting, expansion and word promotion between the keywords queried by users and advertising materials, improve the distribution quality and monetization efficiency of commercial content, and balance the controllability and fairness of advertisers' delivery in the bidding environment.
Figure 1 Several business technical issues of search ad recall
The business characteristics of Xiaohongshu's search advertising recall are mainly reflected in the following three aspects:
- "Strong semantic constraints" Xiaohongshu search is a general search engine that attaches great importance to the natural language understanding of user queries and content, and strictly restricts semantic relevance and user experience in the process of advertising recall.
- "Rapid growth of materials" Xiaohongshu search advertising has a rapid growth in the scale of materials at this stage, and the requirements for the timeliness and coverage of recalls of new and high-quality materials, and the sequencing and mechanism changes of the post-link, so as to maintain business acumen for recall supply in the growth stage;
- "Coexistence of multiple advertising goals" The coexistence of advertising plans for multiple advertising goals such as clicks, transactions, and lead retention in Xiaohongshu search ads makes it difficult to statically describe the advertising value for recall, and recall should be optimized for the conversion goals delivered by advertisers under semantic and experience constraints to maximize the distribution efficiency of the platform.
Water level indicator for traction recall iterations
According to the characteristics of the rapid growth of materials in Xiaohongshu's search advertising, we clearly defined the target and water level of the recall, and where there is a recall effect and the Matthew effect status at a glance:
- How to observe the water level and effect space of recall: In order to decouple inventory growth and other sorting stages, and independently observe the water level status of the recall algorithm capability, our core focus is on three aspects: "call to call", "noise correction" and "instant call".
- "Call to the call": Split the commercial value traffic into three sections: head/waist/tail, and the supply of materials that meet the relevant advertising on the head pan-demand traffic is sufficient, and we focus on effect indicators to strengthen the data cycle of high-value advertising; Because the query terms on the waist and tail are more specific and accurate, the flow rate of materials that can meet the high relevance is greatly reduced compared with the head, which is a strong constraint on the relevance of precise intent search, and the focus is on whether the recall of these highly relevant materials is sufficient;
- "Noise correction": In the stage when the data cycle of the advertising system is not fully established, there are still problems in only responding to the call, and the inaccurate recall will introduce indistinguishable noise to the downstream probabilistic model and reduce the allocation efficiency of the entire advertising system, so the more in the development stage of the post-link model, the more it is necessary to pay attention to the badcase in the recall link, especially the part of these badcases that is not filtered and selected by rough arrangement will directly affect the number of entries in the fine arrangement. It is necessary to do more selection and deviation correction in the recall link before the rough arrangement;
- At the same time, grass planting advertisements and live broadcast advertisements for new products also have strong timeliness requirements, so that the recall cold start speed of new advertisements is not lower than the life cycle of material testing, so that advertisers can "quickly start the volume when they are launched, and they can quickly increase the volume when the price is raised".
- "Recall Water Level Dashboard": From the above definition, we get a concise and informative recall water level board. By layering the sampling query by frequency and the offline correlation score of the full inventory, we can calculate the coverage of the highly relevant inventory and the actual recall, the proportion of noise in the recall results that do not meet the correlation, the water level difference between the inventory and the PVR of the actual recall K under different queries, and the recall success rate of the cold-start inventory. This allows us to clearly measure whether the efficiency of the recall algorithm is in terms of material inventory or recall capacity, how much water level is still far from the ideal state, and whether the recall of highly relevant inventory is fair, rather than focusing on the hot advertisements with the strongest Matthew effect. Based on the above recall analysis, we determined the performance improvement goals of supplementing high-relevance ads on the waist-tail query and focusing high-value ads on the head-to-head query.
(The above figures are for illustrative purposes only,
Non-Xiaohongshu real data)
The positioning of semantics and efficiency in the period of material growth
The impact of the recall on the advertising system mainly comes from three aspects: first, the experience-oriented semantic goal, whose responsibility is to recall only the relevance from high to low without considering the bid and expected click-through rate, and to exclude irrelevant ads in advance for the coarse and fine layout stages, so as to reduce the scoring error; As the advertisements that meet the relevance are much higher than the recall quota, the semantics of the pan-intent change from the optimization goal to the constraint item, and it is necessary to take into account the optimization of personalization and platform efficiency, and exclude the low-value advertising candidates with insufficient bidding power and difficult to bring clicks and conversions in advance in the recall stage, so as to improve the bidding intensity and traffic monetization efficiency in the auction stage. Finally, the exploration goal of the bidding ecosystem requires the ability to find a batch of potential ads that match the semantics of the user's search intent but have not yet been clicked by a large number of users. Especially in the period of rapid material growth, a variety of recall strategies are particularly important to create an open advertising bidding game environment.
- "Semantic Exploration" and "Efficiency Undertaking": Xiaohongshu's search advertisers are growing rapidly, and the pool of new ad candidates is expanding rapidly. In addition, when conversion offer ads are dominant, the ad offer is also strongly coupled with the advertising system itself, which may make the system more prone to the Matthew effect. In order to solve the above problems, our solution is to build a data loop, in semantic recall, in a wide range of waist and tail traffic, without considering efficiency factors, so that the relevant advertising candidates are recalled first, and user feedback signal exploration and accumulation; In the efficiency model oriented to clicks and conversions, with the goal of maximizing the value of platform allocation, the data cycle of personalized, high-value advertisements and high-link pass rate advertisements is strengthened on the head traffic of pan-demand searches. In this way, the rapidly expanding new ads can also precipitate data at the waist and tail without long-term support to occupy the competition quota, and in the advertising system link in the growth period, we try to avoid long-term reservation of support quota, support coefficient and other competitive factors, so as to pursue the balance between recall relevance and efficiency;
- "Semantics and Efficiency-Pareto Surface": Semantics and Efficiency-Pareto Surface refers to finding a set of optimal candidate sets that meet all the objective constraints in the case of semantics and efficiency. As shown in the figure, the ads on the Pareto surface are the result of our recall, filtering ads that are below the relevance red line standard, and filtering ads with low conversions and low bids on pan-intent search to get to the bottom of the experience red line and the exploration line. At the system level, we designed a multi-objective fusion and truncation strategy to distinguish the differences between pan-intent and precise intent, and the quota ratio of the recall channel on the precision intent traffic at the waist and tail and the high-relevance advertising retrieval strategy were combined with the high-relevance advertising retrieval strategy to guide the high-relevance but low-value ads to accumulate feedback data on the waist-tail traffic, actively change the exposure distribution, and drive the entire advertising system to optimize in the direction of Better Distribution rather than the Matthew effect.
Fig.2 "Semantic exploration" and "efficiency inheritance" strategies
Soldiers are impermanent, and water is impermanent
Different from the iteration of the refined effect of the mature business, Xiaohongshu's search advertising is still in the business growth period, especially the need to follow the trend at the growth explosion point, use appropriate technology selection to match the scale of materials and development stage, and flexibly and agile iteration to produce more business benefits.
- "Trend Value": Some strategies that are not effective may have significant effects as the stage of business development changes, which requires us to look at problems in a trend-oriented way, retain long-term AB experiments, capture changes in the situation in a data-driven way, and re-evaluate value in a timely manner. For example, the introduction of too much e-commerce advertising inventory in the early stage of the recall will reduce the average bidding of the participating queue, resulting in a negative market effect, but with the continuous improvement of e-commerce advertising bidding ability, the problem of bidding water disadvantage disappears, and the introduction of full e-commerce inventory will even bring a positive market effect;
- "Optimization in the time domain dimension": Daily iterative optimization is mainly focused on the AB perspective, and at the same time, optimization opportunities in the time domain cannot be ignored, for example, in the time window of 618 and Double 11, the recall strategy of commodity planting and grass is strengthened, and the recall strategy of travel is strengthened in the time window of May 11 and 11, so as to provide a strong leverage for technology iteration to amplify platform revenue;
- "Progressive refactoring": Driven by business problems, we first use simple solutions to quickly obtain most of the benefits, and then gradually iterate the model capabilities and technical systems to the ideal state and the frontier with business development and technical infrastructure upgrades. For example, in the early stage of the recall, the literal rules, whitelist fetching and blacklist filtering strategies are given priority to quickly solve the problem of rough fine extraction and link blockage in the recall, and then the model is used to solve the deep semantic matching problem. In addition, with the change of the ability of coarse, fine and correlation of the backlink, the effect space of the recall has also shifted from noise correction of the recall results to underfilling and excessive post-link misfiltering, and the strategy has also shifted from accuracy assurance to recall rate and delivery strategy.
Algorithm and computing power synergy,
Optimize both model performance and effect
The recall model is unique in that it optimizes the performance of the retrieval model in the case of a strictly limited response time. Therefore, it is necessary to work closely with the engineering engine to jointly optimize the computing power and algorithm to improve the distributed training speed of the model, the retrieval efficiency and real-time performance of the index. At the same time, under the LLM technology infrastructure, the low-cost of high-performance GPU sequence operators and inference have become technical dividends, which gives us the opportunity to gradually transition the technology stack, so that large models can introduce more interpretable information based on knowledge reasoning and image aesthetic style to the advertising system, and generative retrieval and large model representation bring the imagination of Scaling Law to recall. In the era of large models, we will cooperate with the computing engine team for a win-win situation, and create a better computing power optimization support and cooperation atmosphere for subsequent technology iterations.
The following lists the efficiency improvement ideas and technological evolution of Xiaohongshu's search ad recall technology at different stages of development;
Figure 3 The relationship between the output of the recall effect and the stage of business development
- The first stage of "advertisers express their willingness to launch independently": In the early stage of business development, brand advertising dominates and there are relatively few materials that match the query, and this stage mainly relies on advertisers to purchase a large number of keywords independently to achieve highly controllable and highly explanatory commercial traffic acquisition. The first version of the recall channel focuses on the two-stage recall of Query rewriting + inverted audition, constructing the inverted index recall of bidword, and providing the expansion ability of Bidword to match more Bidword through query semantic expansion, which constitutes the cornerstone of Xiaohongshu's search advertising system;
- Phase 2 "Decoupling the Buying Relationship": With the entry of small and medium-sized businesses, the recall ability is increasingly limited by the insufficient ability of small and medium-sized advertisers to buy words, so it is necessary to break through the strong dependence on advertisers to buy words independently, fill the commercial depression, increase the bidding depth, and move from the user's search behavior directly to the ability of advertising recall.
- In this stage, a vector model with relevance retrieval as the goal and Bert as the model base is introduced, the query and notes are mapped to the same semantic hypersphere, and the recall score is calculated by multiplying the whole database matrix on a small number of ad pools, supplementing the effective materials that match the relevance but the advertiser does not buy the relevant words, and driving a wave of increments in the filling rate and PVR of the competition.
- Optimize the ability to push words, not only to accurately reflect the advertiser's marketing demands to words with commercial value, but also to improve the efficiency of word pushing, and actively create a matching relationship between the supply of query rewrites and Ad word pushes.
- Three-stage "multi-objective recall": With the abundance of materials to meet the search relevance of various industries, especially the increase in advertisers using conversion bidding (leads, e-commerce), the number of ad candidates that meet the semantic constraints far exceeds the recall quota, so it is necessary to optimize the ads with high click-through rate (CTR) and high effective cost per thousand impressions (ECPM) from the semantic matching ad set, so as to pursue the rich supply of high-relevance, high-conversion rate and high-value recall, and accelerate the establishment of data cycle of high-quality advertising. At the same time, with the increase in the budget of conversion offer-type advertising, the conversion rate for traffic and final high conversion value (CTCVR) recall has also brought new recall increments.
- Taking relevance as a single recall target, it is expanded to a recall target of relevance, click conversion, and high participation value, and is designed as two types of recall channels: the semantic model undertakes the medium and long-tail precise intent query and the cold start material recall, and the efficiency model undertakes redirection-oriented, post-link pass rate and platform revenue efficiency; At the same time, the efficiency model needs to have a short model switching interval to capture the recent data distribution changes faster, and we gradually improve the recall model and index from daily training switching to multiple cuts per day.
- The efficiency model includes not only ECPM and CTCVR for the final traffic value, but also for the sorting and strategic preferences of different stages in the advertising system, such as correcting the bias of relevance admission rules and multi-stage sorting pass rate, and introducing CTCVR * CPA in the recall channel for high conversion, which is similar to high ECPM recall, so as to improve the consistency of multi-stage allocation of the advertising system funnel.
- According to the capability development stage of different inventory materials and post-link models, the multi-objective and multi-model ensemble extended recall increment was completed. At the same time, the quantity of materials that meet the delivery standards varies greatly in different industries, and the unified setting of the global static recall quota will lead to insufficient recall when the quota is insufficient, and introduce indistinguishable noise for the downstream discriminant model when the quota is excessive.
- Four-stage "Enhanced Retrieval Model and Index": With the rapid ramp-up of the number of materials, the increase in the scale of recall candidates caused serious selection bias problems in the retrieval model at that time, and it was impossible to reliably sort the tail and unseen advertisements, resulting in the recall of some badcases;
- The improvement of the learning quality compared with the retrieval model, especially in the personalized modeling of the efficiency recall model, has brought significant effect improvement, which will be introduced in detail later.
- In order to open up the limitation of the lack of information interaction on the expression ability of the vector inner product model, we upgraded the recall base model from twin towers to multi-layer MLP and Target Attention, and cooperated with the HNSW hierarchical index to approximate the recall ability of the whole database retrieval with fewer scoring equals, breaking the ceiling of the original vector model. Let the model iterate enter a new stage;
- The inverted index can also absorb the advantages of light computation of the vector retrieval model, and through the quantization of semantic vectors, the relevance of the inverted index channel conforms to the material recall coverage can be greatly improved with minimal storage and computational consumption.
- Five-stage "Search Paradigm Revolution in the AIGX Era": The autoregressive + instruction alignment scheme of large models breaks through many cognitions in the supervised learning era, and allows algorithms to have outstanding capabilities in inference, multimodal understanding and scaling law. In the search and advertising industry, the most direct face of the subversion of the search form by the large model, the application of large model technology is the lifeline to maintain the future market share of the search business, we have seen the rapid change of cutting-edge work, we have also made some effective attempts;
- Common sense reasoning and completion Query search intent: The semantic expansion and annotation capabilities of the large model are applied to the "Rewrite" task, with the help of CoT ability to accurately understand the intent of complex query words, and semantic association expansion to release incremental needs, such as expanding the POI information of peripheral travel for location common sense information, and associating education and training needs at different stages according to children's age;
- Extraction of advertisers' core selling points: Some of Xiaohongshu's business notes implicitly promote products or services, and the expression of advertising intentions is more subtle and obscure, through the instruction adjustment of the large model, it can extract the marketing selling points and structured product descriptions from the advertorial description and image modality, and help advertisers more accurately reach the relevant search intent through the large model "Suggestion", and apply the extracted information to the feature input of the retrieval model and the synthesis of long-tail semantic samples;
- Using large models as better encoders: After the representation encoder is replaced by a BERT to a large model, the input and output have changed from structured features to free-form, unstructured, and even natural language descriptions, and at the same time, the vector retrieval performance can conform to Scaling Law in terms of model parameters and data scale, which is very exciting. Through the fine-tuning and alignment of Prompt and Parameter-Efficient, we align the LLM for deep semantics to the user behavioral interest space and advertising relevance criteria in Xiaohongshu to become a marketing content encoder aligned with behavioral preferences, and enhance the semantic retrieval effect on long-tail Query and low-frequency materials through Semantic ID and I2I extension methods.
In the past year, we have continued to iterate in the fourth and fifth phases, and have done a total of five phases of Launch, contributing a total of 1+5% CPMs to revenue efficiency, and the efficiency channel can cover 80% of online clicks and 60% conversions. The following chapters will introduce the practices we have done, the experience of stepping on the pitfalls, and the thinking behind the fourth and fifth stages.
The retrieval technology can be applied to the relationship modeling between any modality of text retrieval text, behavior retrieval behavior, text retrieval image, and image retrieval image, which can be used for supervised learning for specific retrieval tasks, and can also be self-supervised to mine deep semantic associations. Therefore, it is very important to have a set of retrieval algorithm technology base for long-term and efficient iteration of recall.
In order to determine the selection of recall technical solutions, we follow the following essential questions:
- We need a high-performance retrieval framework with a high upper limit of model retrieval capability and unified maintenance, and use A10, T4, L40s, and L20 graphics cards on the market at the moment of GPU tension, on the one hand, to improve the iterative efficiency of model training, improve the timeliness of model training, reduce the time spent on retrieving model scoring, and increase the scale of candidate scoring; On the other hand, we will work with the engineering engine team to create a better foundation for cooperation in the subsequent fine-tuning of large models and iteration of inference technology.
- "Liberating the upper limit of the recall model": As the supply and demand matching problems that need to be solved become more and more complex (such as deep transformation targets), inspired by the work of SL2G and binary foils, we found that using multi-layer MLP and Target Attention instead of the inner product distance metric can greatly improve the model recall rate on the deep transformation target task, and still maintain a good distance measurement property.
- Regardless of whether it is a recall model based on BERT or DNN, the two-tower paradigm still has advantages in computing and storage efficiency, but the distance measurement method of the two-tower model greatly affects the upper limit of the fitting ability of the model. However, the nonlinear distance measurement of a shallow DNN and the introduction of Target Attention between two-sided features can enable the model to interact with two-sided information at an early stage, and approximate the fitting ability of a full-feature intersection, so as to better solve the disambiguation problem and the distance calculation problem of sparse data. At the same time, the attention capability also makes the recall link more sensitive to semantic sequences and pre-behaviors, thus greatly opening the effect ceiling of the recall model.
Fig.4. Hierarchical retrieval model of complex model (refer to two-way foil architecture)
- "Focus on the core optimization points": There are many excellent search efficiency improvement methods in academia and industry, and after our practice, some of them are the core optimization points to be solved by Xiaohongshu in the current development stage and material structure, and some are relatively insignificant in our scenario, which are defined as marginal optimization points. From our practice, the core optimization points and marginal optimization points are listed as follows:
Partial order samples and correction
Sample selection bias will lead to excessive bias of the advertising system towards some specific advertisements, such as popularity bias will cause the model to bias the sample with higher popularity, or show great randomness on the samples that the model has not seen, which will affect the exposure opportunity and advertising effect of the advertisement, so how to correct the model through sampling and sample construction has become an important iterative direction.
Whole-space partial order relationship: only random negative sampling from the whole sample space will lack the description of the funnel selection partial order, our negative sampling strategy is negative sampling in the batch with difficult negative samples, and the sample organization changes from the full-sample space shuffle to the request granularity shuffle, and the participating ads under the same request are continuously organized in the sample, so that the global negative samples and the difficult negative samples have the opportunity to be sampled and reused in the same batch. Taking into account both global negative sampling and popularity negative sampling, the model captures the full-space partial order relationship of random < filtering < competing < clicks through the order-oriented Listwise Loss, and corrects the modeling target from the expectation of click volume to the expectation of click rate and post-link pass rate.
- Be wary of Batch Norm: Batch Norm proves to be crucial in self-supervised tasks such as BYOL, but in Pairwise/Listwise tasks with mixed input of positive and negative examples, it is necessary to be wary of the information leakage problem of BN, whether it is at the feature level or the interaction layer, BN will generate hidden dangers.
Figure 5 Negative sampling in a batch with difficult negative samples
Proportion of difficult negative samples: The selection of negative samples that are too difficult will lead to a decrease in effect or even direct overfitting of the model, so we use segmented sampling of post-link participating advertisements as difficult negative samples, that is, the proportion of difficult negative samples is empirically controlled to 1% of the negative samples by the ads that are ranked lower in the fine ranking and those that are filtered out by the relevance access strategy.
- In the comparative learning study on the difficulty and function of negative samples, it is proved that 5% of the decision boundary samples in the negative samples are the most useful, and the most difficult 0.1% are even harmful (https://arxiv.org/abs/2010.06682).
- Multi-stage learning: In order to optimize the selection bias correction and decision boundaries at the same time, some course learning training schemes will train with simple negative samples in the first stage, and then use difficult negative samples for triplet training in the second stage. In our practice, the effect of this approach is similar to that of training directly with mixed negative samples in a single training session, and multi-stage training increases the complexity of iteration, so we did not adopt this scheme in the end.
More negative sampling: It is agreed that contrastive learning benefits from larger-scale negative sampling, and the more negative sampling, the smaller the dependence on the quality of negative sampling, along this idea, we have greatly expanded the scale of negative sampling (the inner product distance measurement model is extended to thousands, and the Attention+MLP complex distance measurement model is limited by the training speed to 128), and the batch size of negative sampling candidates has also been amplified, so that the negative sampling candidate space reaches 10,000 and reduces the probability of sampling collision. Expanding the negative samples also brings about the attenuation of the training speed, but we greatly improve the training efficiency by optimizing the in-batch block negative sampling operator, which makes the computational training speed increase instead of decreasing when the negative sampling scale is doubled.
- Cross-batch: In our dataset, the in-batch negative sampling of 10,000-level candidates performs well in terms of training speed and effect, considering that the new hyperparameters introduced by queue caching and the complexity of development are not conducive to rapid iteration, we do not use queue caching, cross-batch and other methods to further expand the scale of negative sampling candidates.
- Dynamic Negative Sampling Ratio: A semi-quantitative theoretical framework studies the optimal negative sample size problem of InfoNCE in different tasks (https://arxiv.org/abs/2105.13003).
Positive sample enhancement: The existing feedback data is relatively sparse in the long-tail traffic, so that the learning of the posterior model on the long-tail is not sufficient, and the long-tail traffic is the main battlefield for recall https://arxiv.org/abs/2405.06932 and PVR improvement https://arxiv.org/abs/2201.12086), a batch of semantically consistent long-tail queries is synthesized for the advertising materials.
- Self-supervised task: In order to further enhance the model's ability to understand the deep semantics in the Xiaohongshu scene, we map the long-tail query to the synonymous head-waist query, and the long-tail ad to the similar head-waist advertisement, so as to improve the rewriting and semantic retrieval effect, we introduce contrastive learning self-supervised task based on the Bert pre-training paradigm based on Mask cloze and fill-in-the-blank. Specifically, two copies of the Query and Ad text descriptions are constructed through the element substitution, noise injection, and encoder dropout strategies, so that the two copies are close to each other in the representation space and far away from other representations. Similarly, the self-supervised method is applied to the behavior sequence, and part of the behavior information is masked to improve the robustness of the short-term and long-term interest representation extraction on the user side (refer to Bert4Rec https://arxiv.org/abs/1904.06690).
- Although Xiaohongshu community and commercialization share the same set of note pools, the community emphasizes ecological governance and commercialization emphasizes marketing value, and the exposure structure of the two is quite different, which makes the introduction of user global feedback behavior as a positive sample in Xiaohongshu search advertising unsatisfactory. At the same time, the semantic and style information of the community exposure results are introduced into the advertising recall through the multimodal semantic representation I2I extension method, which is not directly used as a positive sample but as an information supplement, and a certain benefit is obtained.
Feature & Sequence Modeling:
- Memory ID and Transfer Learning: After enriching the ID features of different granularities on the efficiency goal-oriented model, the personalization ability of the model has been significantly improved. On the basis of the original text input of the semantic-oriented BERT model, more structured features are added, such as prediction categories and quality scores, and the recall rate of the semantic model is also significantly improved. At the same time, the categorical/id embedding trained in the community scenario is introduced into the feature layer migration of the advertising recall model, and it is verified that the transfer representation method has little effect gain on the recall set order modeling, and considering that the introduction of external data dependence is not conducive to rapid iteration, transfer learning is finally applied only to the conversion goal-oriented model.
- Characteristics of the behavior sequence: From the perspective of high-dimensional interest, the behavior sequence introduces an implicit portrayal of user preferences and habits, especially in the scenario of recommendation guided search "push back search", the user's behavior in the Xiaohongshu community recommendation can effectively supplement the search intent and bring significant effect improvement in the recall task.
TopK selection problem: While recalling the effective inventory, it is also necessary to reduce the noise that the recall makes the downstream indistinguishable, so there is a problem of topK and threshold truncation
- Co-occurrence uniqueness analysis and incremental value measurement of different recall channels, tracking the recall increment and noise, post-link pass rate and quota utilization brought by different recall strategies, as the basis for the adjustment of iterative direction and quota strategy.
- Whether to introduce a bid factor in the recall: In the conversion offer ads, the bid has changed from a static private valuation expression to a price-performance threshold selection for CVR/cost, which makes the averaging of bids almost meaningless, and it is difficult to introduce accurate bids in recalls, and relying too much on price adjustment will also make the competition unstable and produce more overruns. There are two ways to do this: one is to use the idea of importance sampling reward in reinforcement learning, using bids to weight samples and sort them in an approximate ECPM manner; The other is to estimate the probability value of the conversion rate (CVR) in advance during the recall stage, but this method will have a strong selection bias problem in the prediction of the recall link. Therefore, our solution is to perform a recall for the CTCVR sequence, and combine the conversion cost CPA reported by each advertiser to perform a fusion ranking to approximate the recall ranking of ECPM.
Compare the quality of learning
NCE Loss for contrastive learning is an order-oriented noise contrast estimation loss function, which can enhance the uniformity of the representation distribution by increasing the representation mutual information of positive sample pairs and reducing the mutual information of negative sample pairs, so that the model can better learn the correlation and structural information in the data, so it can usually effectively deal with large-scale multi-class problems, and is more effective in dealing with data sparsity and data imbalance. Compared with the learning field, most of the work has achieved excellent performance in self-supervised learning using the NCE loss function, and some work has achieved good results by replacing NCE loss with BCE loss on supervised tasks (SigLip https://arxiv.org/abs/2303.15343), and some have combined BCE and NCE in the field of sorting models to be compatible with pointwise in the case of GAUC-oriented learning Scale calibration is not affected (RCR https://arxiv.org/abs/2211.01494).
Temperature and regularity of InfoNCE: By fine-tuning the temperature parameters, a significant improvement in contrast learning can be obtained. If the distribution is smoother, the training process will not converge to the local optimal prematurely, and it will be more robust to the noise data, but it will also make it difficult for the model to distinguish difficult negative samples. Conversely, if the distribution is more polarized, the distance between positive and negative samples can be increased, and more attention will be paid to the resolution of difficult negative samples, but this may also exacerbate the popularity bias and make the model more susceptible to noisy data. At the same time, temperatures that are too large or too small increase the risk of gradient vanishing, with logits approaching uniform distribution when too large and gradients approaching 0 with more negative samples, and approaching 0 when the logtis approaching a singlet distribution and gradients approaching 0 when too small. Therefore, it is necessary to use an appropriate regularization strategy to avoid gradient vanishing during training.
- SimCSE analysis InfoNCE essentially "flattens" the singular spectrum of the embedding space, alleviating the representation degradation problem and improving the uniformity of sentence embeddings (https://arxiv.org/abs/2104.08821);
- Adaptive temperature: For users with a lot of noise feedback, it is unwise to pay too much attention to difficult negative samples, and they need to be amplified; But for those users who have clear and sufficient feedback, decreasing can enhance model convergence and discrimination. The larger the cumulative loss of the user granularity, the larger it is, and the smaller it is (https://arxiv.org/abs/2302.04775);
Mitigating the Sensitivity of Pseudo-Negative Cases: InfoNCE In the Temperature At a younger age, the essence is to increase the distance between the positive and the most difficult samples s(q, v) > max(s(q, v-_1), ... , s(q, v-_n)), which gives InfoNCE a self-discovery advantage (https://arxiv.org/abs/2012.09740) for difficult negative samples; But at the same time, when the negative example contains pseudo-negative noise, the model will give a potentially positive example a high gradient penalty to the point that it will affect the model convergence, and even represent the attenuation and collapse of singular values, and the training phenomenon is that the negative sample logits continue to rise until the AUC falls to 0.5. To solve the problem that InfoNCE is sensitive to outliers, we expand the batch size to reduce the probability of negative sampling collisions. On the other hand, a regular rule for the negative sample logits is added to forcibly push away the negative sample to enhance the uniformity of the distribution and prevent the characterization from collapsing. At the same time, we also refer to the following solutions
- External model pseudo-negative case recognition: A self-supervised trained SimCSE model is used to identify pseudo-negative examples during the training process, and negative samples that are too close to the positive examples will be filtered (https://arxiv.org/abs/2205.00656);
- 负样本re-weight:通过re-weight策略将负样本权重分配至“更合理的区域”,而不是固定的仅关注最难的样本(https://arxiv.org/abs/2310.11048),学习一个全局的alpha和rou,控制温度 = alpha * (1 - cos) + rou(https://aclanthology.org/2023.emnlp-industry.72.pdf);
Robust representation of index affinity: We use the L2 distance between advertising representations to construct the HNSW (Hierarchical Navigable Small World) hierarchical approximate nearest neighbor index, according to the geometric properties of the Delaunay graph, although the retrieval distance metric function is an Attention+MLP model, it still does not break the L2-based index Characterize the index distance properties of spatial establishment (https://dl.acm.org/doi/10.1145/3336191.3371830). At the same time, in order to make the complex model have a better retrieval effect on the HNSW index, we add a small-range perturbation of advertising representation in the training stage to constrain the consistency of the distance metric before and after the perturbation, so as to improve the affinity between the distance metric function of the complex model and the L2 distance of the index.
- After weighing the model capability and computing power factors, we choose the 128-dimensional embedding vector as the two-sided representation, and the marginal effect of performance improvement brought by higher dimensions is smaller. The representation method proposed by OpenAI uses a multi-dimensional representation space (MRL Russia nesting doll embedding https://openai.com/index/new-embedding-models-and-api-updates/https://openai.com/index/new-embedding-models-and-api-updates/), It is proved that the retrieval effect decreases when the dimension is larger, but the effect can increase monotonically with the increase of the dimension when training multiple nested dimension representations at the same time.
- Adversarial training: Perturbation of Gaussian noise samples in the direction of maximizing loss enables the model to learn the ability to resist pollution and achieve a certain representation robustness (https://spaces.ac.cn/archives/7234), but in our practice, we find that the perturbation parameters based on gradient adversarial classes are more sensitive, and the improvement of retrieval accuracy under our data is also limited.
- Multi-task learning: The prediction task of advertising industry and category information is added on the item side to constrain the cohesion of representation in semantic space (Que2Search). On the CTCVR conversion target, the CTR is jointly trained to click on the target's data to alleviate data sparsity.
High-performance recall retrieval engine
Referring to the open-source code of binary foil, we use the Tensorflow computing engine as the base to implement a set of retrieval frameworks with high ceilings and easy iteration. Adhering to the concept of TF computing graph, the retrieval process is fully graphized, which is naturally integrated into the optimization method of Tensorflow.
Figure 6 Overview of the high-performance recall retrieval engine
The search is fully graphic
The search is fully graphic
- On the basis of TF native operators, the retrieval logic implements some custom operators to solve the performance problems in the retrieval process. The retrieval process of three-layer neighbor diffusion is fully embedded in the TF computational graph, which makes the retrieval strategy iteration very flexible.
- The open-source code does not solve the problem of real-time indexing, so a special data operation operator and index independent update link are designed to achieve minute-level and time-sensitive index switching, effectively supporting the rapid growth of index magnitude.
GPU computing performance optimization
- At the level of computing graph optimization, the computing performance and resource utilization efficiency are greatly improved by relying on some optimization methods of TF, including but not limited to the following aspects:
- 调整 Placement、混合精度、bitmap 算子融合、GPU 计算逻辑调优、编译与算子融合等
- At the framework level, relying on the industry's advanced DeepRec framework, the underlying resources can be reused only through configuration, greatly improving the QPS of the system
Recall computing infrastructure in the era of large models
- Real-time large model word pushing: On the B-end advertiser background, we have implemented the real-time large model selling point extraction and word promotion capabilities for marketing appeals, and accelerated the first word (https://arxiv.org/abs/2402.05099) for batch CoT inference through prefix caching technology, and are trying to miniaturize large models, such as accelerating 7B through 1B small model speculative sampling Model Reasoning (https://arxiv.org/abs/2302.01318);
- In order to align the general large model to the semantic space of advertising materials at low cost and to the advertising relevance standard, we have built a multi-card training infrastructure for low-cost large model fine-tuning with the help of open source libraries, and adopted a lightweight Lora and DPO method that is more simplified and has lower requirements for the training system.
Figure 7 Overview of the large model training and inference engine
This paper introduces the evolution, practice and thinking of Xiaohongshu's search ad recall under the business background of strong semantic constraints, rapid growth of advertising materials, and coexistence of multiple advertising targets. First of all, we clearly defined the water level indicators of the recall algorithm capability, and determined the respective efficiency improvement targets of the head and tail flows. Then, we developed from a single semantic goal to joint modeling of semantic efficiency, and implemented the data cycle strategy of "semantic exploration" and "efficiency undertaking". Finally, we replaced the GPU high-performance retrieval engine with the complex model of MLP+Attention, and applied it to the three-way recall model of semantics, clicks, and conversions, and the efficiency channel can cover 80% of clicks and 60% conversions online, and after five phases of optimization, the platform revenue CPM1 increased by +5%.
With the continuous development of large model capabilities and inference costs, it is foreseeable that the current search technology stack will become one of the RAG pathways for large models, and user search interaction will also directly provide accurate answers through multi-agent and multi-round dialogue, and modify answers in real time. The reasoning and emergent capabilities of large models not only revolutionize the way human knowledge is acquired, but also open up the next generation of human-computer interaction in natural language, bringing dual opportunities for science and industry. In this change, the search business is the most directly facing the subversion of large-scale model technology, and the application of large-scale model technology is the lifeline to maintain the future market share of the search business.
Looking back at the current era of artificial intelligence generation of everything (AIGX), aerospace and energy revolution, as if we see the twilight of the traditional search and promotion technology stack and efficiency improvement paradigm, we are looking forward to the new technology to rescue us, and we are also facing the large-scale model revolution from hesitation and paranoia to acceptance and reconstruction, shaking off the vitality and savage growth of the road ahead, and standing in the middle of the new stage of the era with high spirits.
Kuang time
Xiaohongshu is the algorithm architect of search advertising, responsible for the design and technology development of advertising recall strategy, marketing scenario model, keyword recommendation in search advertising scenarios.
Jiang Zhe
The person in charge of Xiaohongshu's search advertising algorithm recall & word promotion direction is responsible for the design and technology development of advertising recall strategy, keyword recommendation, and relevance strategy in search advertising scenarios.
Source-WeChat public account: Xiaohongshu Technology REDtech
Source: https://mp.weixin.qq.com/s/h-zChStPhB7-11YtV5J9fg