background

As of August this year, the number of registered users of Zhihu has exceeded 200 million. We are facing greater challenges and tests in the management of spam information. In the past, we have achieved very good results by continuously upgrading the strategy engine of "Wukong", and through the combination of behaviors and environments, resources, texts and other dimensions. Recently, we have tried to introduce deep learning to recognize garbage text, and Wukong's governance ability of Spam has reached a new level.

Problem analysis

We have combed through the current spam text on the site and found that there are currently four main forms:

Diversion content: This kind of content can account for about 70%-80% of the spam text in the community, and the more typical ones include training institutions, beauty, insurance, and purchasing agents. The diversion content will involve QQ, mobile phone number, WeChat, url and even landline, and there will be all kinds of special spam text at some special time nodes, such as the World Cup, Double Eleven, Double Twelve, which are all good times for the black industry to make a lot of money.
Branded content: This type of content will have more typical SEO characteristics, there will be no obvious diversion logo in the general content, and the form of fraud will appear in the form of a question and answer, such as asking what brand in the question? What about a training school where? Then make a recommendation in the corresponding answer.
Fraudulent content: This kind of content generally appears in the form of impersonating celebrities and institutions, such as bicycle refunds, and providing fake customer service numbers in the content to defraud.
Harassing content: For example, some inducement and survey bulk content, which seriously affects the experience of friends.

The core benefit of these spam texts is that on the one hand, they are oriented to the communication of the site, and on the other hand, they are oriented to search engines to achieve the purpose of SEO.

Introduction to algorithms

From the perspective of algorithms, this problem can be regarded as a text classification problem, and the content in the site is divided into two categories: spam text and normal text. There are many commonly used text classification algorithms, and we are not going to go into detail about each of them, but we are just sharing some of the problems we encounter in dealing with practical problems.

The first question we had was whether to use RNNs or CNNs. In general, CNNs are hierarchical architectures and RNNs are continuous structures. CNNs are suitable for tasks that are determined by a few keywords; RNNs are suitable for sequential modeling tasks, such as language modeling tasks, which require flexible modeling based on contextual understanding. This conclusion is obvious, but there is no supporting article in the current NLP literature. In addition, in general, CNNs are faster than RNNs in terms of training speed and prediction speed. Considering the main forms of spam text on the above site, keywords will appear in both diversion and brand content, and at the same time, the speed of spam text detection is relatively high, so we finally use CNN. A typical CNN text classification model is shown in the following figure.

Next, one of the questions we had was whether to use words or phrases as input. Words have a higher level of abstraction and richer meanings than words. However, QQ, mobile phone number, WeChat, url, landline, etc. in the diversion content usually do not appear in the existing thesaurus, and brand words also have similar characteristics, which are generally unlogged words. Also, diversion content often has variations, using words as input, and doesn't capture similar features well. So, we ended up using words as input.

After deciding to use words as input, it is necessary to consider whether to use the Embedding layer of the pre-trained word vector initialization model on the corpus of Zhihu site, or to directly generate the initial word vectors randomly in the classification model. The consideration here is that the data distribution of garbage text is quite different from the data distribution of Zhihu site text, and spam text is a relatively specific field compared with the normal text on the site. So we use random initialization of word vectors.

After deciding to use word vectors, we observed that key information such as "those who are interested in adding me to consult: 2839825539" and "looking for Beijing, He, He, Tian, Xia" are usually very long in terms of words. Therefore, CNNs need a larger receptive domain to extract relevant text features, and if the convolutional kernel size is simply increased, the number of parameters will be increased. We consider using dilated convolutions to increase the receptive domain of the convolution without increasing the number of network parameters. A typical dilated convolution is shown in the figure below.

In addition, we observed that the spam text that needs to be identified is not all short text, but also some long text. Due to the relationship of text length, if the output of the convolutional layer is simply averaged and output to the fully connected layer, the key features that can determine whether the text is garbage or not are garbage text are likely to be overwhelmed by other features, making it difficult to improve the accuracy of the model. Therefore, we have added an Attention layer through which the key features are given more weight. The Attention calculation method is shown in the figure below.

Based on the above analysis, we finally adopted the model structure as shown in the figure below.

Structure of the spam text classification algorithm

Model effects

At present, the spam text model will score all the content on Zhihu and output a score between 0-1, and the system will process the high-scoring content.

The performance of the model score in some of Zhihu's business lines

Text score >=0.9Accuracy >=0.8Accuracy >=0.7Accuracy

Answers: 100.0% 99.8% 95.6%

Questions asked 100.0% 99.1% 97.7%

Reviews 100.0% 99.6% 98.0%

In the current case, the model can be combined with other anti-fraud dimensions to delete content with a spam score of 0.5 or more with an accuracy rate of more than 97%. Since its launch, thousands of pieces of spam have been deleted every day.

Models are processed in real time

In addition, it is worth mentioning that during the Dragon Boat Festival, a wave of illegal SPAM appeared in Zhihu station, and the spam text model covered more than 98% of the content, making this wave of attacks last about 1000 before stopping.

Dragon Boat Festival spam attack

Follow-up plans

Spam text recognition is a long-term offensive and defensive process, and the spam text on the site will continue to evolve over time, and the effect of the existing model will also change accordingly. In order to deal with the challenge of spam text on the site, we will continue to collect badcases to further optimize the performance of the model.

Author: Shi Le

Source: https://zhuanlan.zhihu.com/p/46877662

Zhihu anti-cheat spam text recognition

background

Problem analysis

Introduction to algorithms

Model effects

Follow-up plans