天天看點

Millisecond Marketing with Trillions of User Tags Using PostgreSQL

Millisecond Marketing with Trillions of User Tags Using PostgreSQL

<b>By taking advantage of just two small PostgreSQL functions, we are able to solve one of the biggest long-term problems plaguing the industry.</b>

The recommendation system is the backbone of any advertising platform. A proper recommendation system must be precise, real-time, and highly efficient.

But which advertisement platform is the best one? Which advertisement platform has the strongest core? To have an effective recommendation system, it needs to have three fundamental features: precise tagging, real-time tag updating, and efficient tagging.

Precision refers to the ability to draw distinct, precise user portraits based on massive amounts of user behavior data. This profiling system is also called a tag system. Effective recommendation often hinges on accurate tagging. For example, you likely would not market a pair of reading glasses to the average young man, unless his purchase intention tags demonstrate that he might wish to purchase a pair of reading glasses.

Tags must be updated in real time. Many tags are extremely time sensitive, for example marketing to certain customers may only be effective within a certain time window. Likewise, a certain product may be appropriate one minute and less so the next. If your tags are generated every other day or longer, you may miss the best recommendation opportunity. Therefore, it is critical for a recommendation system to operate in real time.

Efficiency refers to the system's ability to concurrently react and adapt to user behavior. After all, those of us in the advertisement industry need to obtain information as quickly as possible. If you have several clients purchasing advertisements, your system's capacity for concurrency will be tested before long.

An advertisement platform may only be competitive if it possesses the above three core capabilities. Of course, cost is also a concern, particularly hardware, development, and maintenance costs.

We will use an E-commerce recommendation system as an example to demonstrate database design and optimization techniques for recommendation systems.

For example, how does a store discovers target customers?

To answer this question, we first need to collect some data. For example:

1.Users who browse and make purchases at this and other similar stores. Browsing online stores generates behavior records, for instance the stores and commodities we browse, the products we purchase, the stores where we make those purchases, and the times at which all of this happens. Then, for each store, a sample of users and browsed and purchased there can be extracted.

2.After obtaining these user groups, you can filter them for similar purchase intention or other attributes. You can analyze the attributes of these groups to form a larger target group and market to them.

These are just two simple deductions that can be made concerning E-commerce recommendation systems.

The number of e-shoppers may reach several billion worldwide, and stores could reach the order of tens of millions.

The number of available products (divided into subcategories like bar codes) might even reach several hundred million.

Store tags for a single user may include the stores the user visited, the time of visit, products viewed and number of times they were viewed, and finally the products purchased by the user. Hundreds of such tags can be generated for a person over time.

When the number of user tags reaches several hundred of thousand, we get a clearer picture of the user and his/her habits.

Estimates show that we could see trillions of user_tags in the near future (several billion users, each with hundreds of tags related to stores/products).

First, we need to organize key data.

Data include user ID, browsed stores and times, browsed products and times, and purchased products and quantity (times/quantity can be set to represent a range. For example, 0 can represent any quantity less than 100, 1 can be a quantity between 100 and 500, etc.).

These data points can be easily generated from user behavior data, and in turn combined to obtain a target customer group for store or product specific advertisement. For example, we may use this data to define a group of users who browsed haircare related products ten times in one week.

1.Generate IDs for stores and products.

2.Track browsing times and quantity of purchased products.

Each user may browse and purchase multiple products over a certain period of time. If a record is generated for each store and product, we will wind up with a massive number of records. This can be a waste of space and negatively impact query efficiency.

Arrays can be used in PostgreSQL to make these tasks possible while saving storage space. It also allows for array indexes, which improves query efficiency.

3.The table structure is as follows:

3.1Range table, including dozens or even hundreds of records

Field and description

3.2User tag table

3.3Step-wise times

Browsing times and purchase quantity are continuous values. It is suggested that they be step-wise for better mining.

According to the design in item 3.1, 1-10,000 constitutes a step, and 10,001-100,000 constitutes the next step.

Example

3.4Combine product and store IDs and the step into a new value - method 2

The value is stored in text[], for example store ID:OFFSET. A text array (text[]) is less efficient than an integer array (INT[]), and it requires more storage space.

However, if you can handle the downsides, text[] does require slightly less development effort.

3.5Combine product and store IDs and the step into a new value - method 2

The first method uses a text array, while the second method uses a more efficient int/int8 array.

A formula must be used to make the int/int8 value express two meanings (the original store, product ID, and step).

The formula design (formula and constant) is as follows:

Take the number of visits to the store (s1) field as an example:

Example (step is 19, and new_start_val is 1)

4.Shard

For example, it is suggested that each 5 million or 10 million users be organized into a partition. Query efficiency can be improved using parallel query.

You are recommended to use parallel query (with plproxy, a connection is given for each partition to carry out parallel query) to quickly retrieve all users.

If you want to obtain users quickly, and need stream return, you should use inheritance (if it is a multi-node, postgres_fdw+pg_pathman or postgres_fdw+inheritance is used) to return via a cursor.

Browsed stores and products, purchased products, and the billions of users tracked over the specified time interval are all aggregated into a tag array to generate over a trillion user tags.

What kind of performance can we expect when parsing billions of user groups based on tags? Since we are using indexing, we can keep the processing time to around 10 ms as long as we also use stream return. Is it starting to feel like the analytics industry is going to be measured in milliseconds?

If you're not familiar with PostgreSQL, you may find these results surprising, but you'll quickly get used to it as you come into contact with PostgreSQL more frequently. It provides a whole host of functions to help you solve a wide variety of problems

We've already explained how to efficiently obtain user data, now let's look at how we can update user tags in real time.

The goal of stream processing is to update user tags in real time. For example if a user generates hundreds of bytes of browsing records every second, then these records need to be merged into the appropriate tags.

With a hundred million or more active users, these update streams need to process up to a trillion updates a day, so how can we update the database in real time? A lot of users probably use the T+1 method and simply give up on real-time responsiveness.

However in reality it isn't an unreasonable thing to ask. For example we can use the stream processing function in PostgreSQL to update super high traffic streams.

You may be wondering whether the database is capable of processing such a stream. How is the data updated in real time in the database?

An open-source product called PipelineDB (based on PostgreSQL and compatible with PostgreSQL) in PostgreSQL is designed to do just this. It helps you merge data in real time (you can set the time interval or the number of accumulated updated rows), and carries out persistent actions after reaching the threshold; otherwise, it first updates continuously in the memory.

There are two articles available for reference

<a href="https://github.com/digoal/blog/blob/master/201512/20151215_01.md">Application of Stream Processing for the "Internet of Things (IoT)" - Use PostgreSQL for Real-time Processing (Trillions Each Day)</a>

<a href="https://github.com/digoal/blog/blob/master/201612/20161220_01.md">Ever-present StreamCompute - Combining PostgreSQL and PipelineDB for IoT</a>

Of course, if it isn't necessary to process data in real time, and T+1 meets your requirements, then it's not necessary to use Pipeline DB.

We know that target customer discovery can only be completed when the tag data has been submitted to the database, so if we don't use stream calculation, rather use frames like JSTROM, then a layer of updates will be ignored. For example, 100 billion updates could become only 100 million.

However, external stream processing does bring with it a number of challenges.

1.JSTROM requires additional computing resources without adding parallel efficiency compared to PipelineDB.

2.Furthermore, it does not provide faster queries than StreamCompute when directly inserting data into the database.

3.It also increases development costs.

For the stress testing phase I chose a machine with 32 cores, 2 SSD cards, and 512GB memory.

It stores 3.2 hundred million users. Each user includes 4 array fields. Each field includes 1,000 elements, i.e., 4,000*3,200 million = 1,280,billion user_tags.

There are 10 tables, each with 10 million users, and 4 tag fields stored with tsvector.

Use a rum index.

Generate the test data script

A tag is generated by the process of combining 5 million unique IDs and 20 unique IDs, and each tsvector stores 1,000 such combinations.

There are 10 tables, each with 10 million users, and 4 tag fields stored in a text[].

Here we use a GIN index. Other factors are the same as in example 1.

There are 64 partition tables, each with 5 million records. An int array is used to store a total of 4 million tags, and each user has 4,000 random tags to ensure that there are enough target customers.

We also use a GIN index for target customer discovery.

Script for generating test data

Begin generating test data

Once the data has been inserted, merge the pending list.

Execute vacuum analyze or gin_clean_pending_list. Refer to

<a href="https://www.postgresql.org/docs/9.6/static/functions-admin.html#FUNCTIONS-ADMIN-INDEX">https://www.postgresql.org/docs/9.6/static/functions-admin.html#FUNCTIONS-ADMIN-INDEX</a>

<a href="https://www.postgresql.org/docs/9.6/static/sql-vacuum.html">https://www.postgresql.org/docs/9.6/static/sql-vacuum.html</a>

<a href="https://www.postgresql.org/docs/9.6/static/gin-implementation.html#GIN-FAST-UPDATE">https://www.postgresql.org/docs/9.6/static/gin-implementation.html#GIN-FAST-UPDATE</a>

Carry out pressure testing on example 3

1.Complete target customer discovery in 10 ms.

For example, search for groups in which s1 includes 3 and s2 includes 4.

We only found a few groups and they are not representative. We need to find more groups.

2.Paging via a cursor, as mentioned above.

3.Stream return, as mentioned above.

4.Parallel batch return

In this step, the plug-in plproxy can be used to specify a parallel for each partition to implement parallel batch return. What are the best results we can expect?

For serial query, all shard tables are queried sequentially, so the operation can take quite a while, 113 ms for 15,221 target customer discoveries, for example.

Parallel query performance is equivalent to the time required for a single partition, usually about 1ms.

Therefore, parallel query can greatly improve overall performance.

References

<a href="https://github.com/digoal/blog/blob/master/201005/20100511_01.md">Use Plproxy to Design a PostgreSQL Distributed Database</a>

<a href="https://github.com/digoal/blog/blob/master/201110/20111025_01.md">A Smart PostgreSQL Extension plproxy 2.2 Practices</a>

<a href="https://github.com/digoal/blog/blob/master/201608/20160824_02.md">PostgreSQL Best Practices - Vertical Database Shard (Based on plproxy)</a>

However parallel queries are unnecessary if you intend to use stream return.

When we are dealing with several billions of users, we can shard them by user ID and then use multiple hosts.

Of course, there's no need to use multiple hosts if your machine has enough space and CPU cores to meet your needs.

What methods can we use to implement multiple hosts? Refer to the following articles for a few simple and quick solutions.

You can think of the node postgres_fdw as the TDDL or DRDS of MySQL. It supports cross-node JOIN, and condition, order, and aggregation push down, making it just as convenient as TDDL/DRDS.

postgres_fdw is stateless, and only stores a structure (a distribution rule), making it easy to horizontally expand the node postgres_fdw.

<a href="https://github.com/digoal/blog/blob/master/201610/20161004_01.md">PostgreSQL 9.6 Modularization, Sharding (Based on postgres_fdw) - Supporting Forward Pass on the Kernel Layer</a>

<a href="https://github.com/digoal/blog/blob/master/201610/20161005_01.md">PostgreSQL 9.6 Sharding + Modularization (Based on postgres_fdw) Best Practices - Design and Practices for General Database Sharding</a>

<a href="https://github.com/digoal/blog/blob/master/201610/20161027_01.md">PostgreSQL 9.6 Sharding Based on FDW &amp; pg_pathman</a>

<a href="https://github.com/digoal/blog/blob/master/201610/20161024_01.md">PostgreSQL 9.5+ Efficient Implementation of Table Partitions - pg_pathman</a>

This requirement falls into the realm of PostgreSQL. Actually, it is simply a location-based KNN query. PostgreSQL can easily meet this requirement via GiST index.

After data sharding, PostgreSQL can still obtain results quickly via Merge Sort.

For example,

<a href="https://yq.aliyun.com/articles/2999">Performance of PostgreSQL for Querying Neighboring Geographical Locations Among Tens of Billions of Data Points</a>

A 12-core machine returns each request in about 0.8 ms. There are about 80,000 TPSs.

Let's get back to three core questions of the target customer system.

We did not go over precision in this article as it belongs more to the realm of the data mining system. We will address precision in a future article (Using the PostgreSQL, Greenplum MADlib Machine Learning Library).

The real-time requirement refers to the ability to update tags in real-time. The solution provided in this article is based on stream processing in the database, which, compared to external stream processing, takes up fewer resources, reduces development costs, improves development efficiency, and is more responsive.

As for efficiency, the solution provided here utilizes PostgreSQL along with GIN indexes of array data to achieve target customer discovery in a matter of milliseconds, even when dealing with trillions of user tags.

<a href="https://github.com/digoal/blog/blob/master/201606/20160621_01.md">For the Tribe - How to Leverage PostgreSQL to Pair Genes to Improve Future Generations</a>

<a href="https://github.com/digoal/blog/blob/master/201612/20161216_01.md">Secrets to Analysis Acceleration Engines - LLVM, Column-Store, Multi-core Parallel, Operator Multiplex Integrate - Open the Treasure Chest of PostgreSQL</a>

<a href="https://github.com/digoal/blog/blob/master/201612/20161213_01.md">Analysis on Requirements for Finance Risk Control, Police Criminal Detection, Social Relation, Networking Analysis, and Database Implementation - PostgreSQL Graph Database Scenario Application</a>

<a href="https://github.com/digoal/blog/blob/master/201612/20161205_02.md">Real-time Data Exchange Platform - BottledWater-pg with Confluent</a>

<a href="https://github.com/digoal/blog/blob/master/201611/20161126_01.md">Application of PostgreSQL in Video and Image De-duplication and Image Search Services</a>

<a href="https://github.com/digoal/blog/blob/master/201610/20161021_01.md">Create Real-time User Portrait Recommendation System Based on Alibaba Cloud RDS PostgreSQL</a>

<a href="https://github.com/digoal/blog/blob/master/201611/20161124_02.md">PostgreSQL and Thoughts on Winning Train Tickets at 12306</a>

<a href="https://github.com/digoal/blog/blob/master/201611/20161124_01.md">Access Control Advertisement Sales System Requirement Analysis and PostgreSQL Database Implementation</a>

<a href="https://github.com/digoal/blog/blob/master/201611/20161114_01.md">Technology used in Double 11 Shopping Events - Logistics and Dynamic Path Planning</a>

<a href="https://github.com/digoal/blog/blob/master/201611/20161115_01.md">Technology used in Double 11 Shopping Events - Word Division and Search</a>

<a href="https://github.com/digoal/blog/blob/master/201611/20161117_01.md">Technology used in Double 11 Shopping Events - Sniping Technology, Down to the Second</a>

<a href="https://github.com/digoal/blog/blob/master/201611/20161118_01.md">Technology used in Double 11 Shopping Events - Millisecond Word Division, Test Regularity and Similarity</a>

<a href="https://github.com/digoal/blog/blob/master/201610/20161001_01.md">PostgreSQL 9.6 Guides Open-source Database to Tackle Multi-core Parallel Computing Puzzles</a>

<a href="https://github.com/digoal/blog/blob/master/201609/20160929_02.md">PostgreSQL's Past and Present</a>

<a href="https://github.com/digoal/blog/blob/master/201609/20160906_01.md">How to Create a GIS Test Environment - Import OpenStreetMap Sample Data into PostgreSQL PostGIS Library</a>

<a href="https://github.com/digoal/blog/blob/master/201506/20150601_01.md">PostgreSQL Database Security Guide</a>

<a href="https://github.com/digoal/blog/blob/master/201605/20160523_01.md">Secrets of PostgreSQL 9.6 – A Single Bloom Algorithm Index to Support Query for Any Column Combination</a>

<a href="https://github.com/digoal/blog/blob/master/201607/20160725_01.md">PostgreSQL Uses Recursive SQL to Find Dependency among Database Objects</a>

<a href="https://github.com/digoal/blog/blob/master/201612/20161203_01.md">Use PostgreSQL to Describe One's Life's Climax, Urine Point, and Low Point - Window/Frame or Slope/Derivative/Curvature/Calculus?</a>

<a href="https://github.com/digoal/blog/blob/master/201612/20161201_01.md">Use PostgreSQL to Make up 618 s of Lost Time - Convergence of Recursive Optimization</a>

<a href="https://yq.aliyun.com/articles/50922">Advantages of PostGIS in O2O Application</a>

繼續閱讀