Reading guide In the digital era, traffic data has become the key for enterprises to gain insight into user behavior, optimize product decisions and improve operational efficiency. With its advantages in real-time, high performance, ease of use, scalability, and security, Alibaba Cloud Hologres can help enterprises fully exploit the value of traffic data, achieve refined operations and product decision optimization, and improve overall operational efficiency and market competitiveness.
In this article, Mr. Yu Wenbing (Jingwen) from Alibaba Cloud Computing Platform Business Unit shares the application of Alibaba Cloud Hologres in traffic analysis scenarios, revealing how it can help enterprises achieve refined operations and product decision-making optimization through high-performance real-time data warehouse technology.
The main content includes the following parts:
1. Introduction to traffic analysis scenarios and pain points: the importance of analyzing traffic data and the challenges faced by enterprises in traffic analysis, including development efficiency, O&M efficiency, and business response speed.
2. Introduction to the core capabilities of Alibaba Cloud Hologres in traffic scenarios: introduces the core technical advantages of Alibaba Cloud Hologres, including multiple traffic analysis functions, powerful real-time analysis performance, flexible storage and computing capabilities, and efficient point query capabilities.
3. Best Practices for Traffic Analysis: Share best practices for traffic analysis based on Hologres, including data collection, real-time processing, data storage, and visualization, and discuss in detail the application and advantages of Hologres in scenarios such as funnel analysis, retention analysis, path analysis, and attribution models.
Speaker|Alibaba Cloud Computing Platform Business Division-Yu Wenbing (Jingwen)
Edited by Wenlin
Content proofreading|Li Yao
Produced by DataFun
01
Introduction to traffic analysis scenarios and pain points
1. Introduction to traffic analysis scenarios
There are many dimensions that can be analyzed in traffic data, including user behavior analysis, traffic conversion analysis, advertising effect analysis, advertising attribution analysis, AB experimental analysis, algorithm model analysis, real-time risk control analysis, etc.
The most common data tracking event model is a data model built based on the various operation behaviors of users in the product, the core of which is to abstract user behavior into Event entities and comprehensively describe these events through five key elements: Who, When, Where, What, and How.
2. Data warehouse model in traffic analysis scenarios
In traffic analysis data warehouse modeling, there are two main model methods to adapt to different data processing and query requirements:
Strong schema model (wide table pattern): It is characterized by fixed columns and types, and the data undergoes a strict ETL (extraction, transformation, loading) cleaning process to match the wide table model of the data warehouse. It is suitable for scenarios that require strict data consistency and integrity, and require frequent complex queries.
Weak schema model (weak schema mode): Data is directly loaded into the data warehouse in its original form, and the processing process is relatively simple, similar to the ELT (extract, load, transform) mode. It is suitable for scenarios where data patterns change frequently, or where raw data needs to be retained for subsequent analysis.
3. Challenges in traffic analysis scenarios
The data warehouse team cooperates with the unified burying platform to process data, and data development students generally flatten the data in Flink and store it in Doris/ClickHouse and other data warehouses in the mode of wide tables and strong schemas to provide data query and analysis. The main challenges on the business side can be boiled down to three aspects
- Low development efficiency
- Buried point testing, Flink code, and data warehouse development need to define the fields, data type, length, constraints, and so on of ultra-long columns.
- O&M efficiency is low
- The upstream tracking information will change periodically, and the fields will be added and deleted, resulting in frequent code adjustments in the Flink code and data warehouse.
- Slow business response
- The new requirements of the business involve the data warehouse team, the R&D team, and the collaborative release of code, and the business response cycle is long.
In addition to the business side, traffic tracking analysis also brings many challenges to the real-time data warehouse platform technology, mainly focusing on five aspects: data warehouse model, storage expansion capability, write capability, query capability, and high availability capability.
- Data warehouse model
- There are many log fields in the traffic tracking scenario, which requires efficient and flexible processing from the processing of weak schemas to processing and storage analysis.
- Storage scalability
- In the traffic tracking scenario, the amount of log data reaches the petabyte level, and the scale of single-form partitions is billions or tens of billions, and storage and computing are separated.
- Write capacity
- The amount of data is too large, and the real-time write capability and offline import capability of high throughput are required to be strong enough.
- Query capability
- The performance of any multi-dimensional OLAP query, UV calculation of any indefinite period range, funnel analysis, retention analysis, and path analysis.
- High availability
- In scenarios with large data volumes, the isolation mechanism between different business queries, and high availability ensure the stability of services.
02
Alibaba Cloud Hologres traffic analysis core capabilities
Hologres is a one-stop real-time data warehouse product developed by Alibaba Cloud, which provides users with a one-stop real-time data warehouse solution with high-performance real-time OLAP analysis, flexible storage and computing capabilities, efficient point query capabilities, advanced data processing features, convenient interactive data lake analysis, and efficient data synchronization and compatibility. Its core technologies and capabilities include:
- High-performance, real-time OLAP analytics
It supports high-performance real-time writing and updating, and can be checked when written, which significantly improves the timeliness of data discovery and mining
Provides multiple storage modes, including column-store, row-column coexistence, and rich indexing policies to meet the needs of different business scenarios.
Distributed storage and parallelization logic are used.
Primary key update and local update are supported.
It provides high-performance queries through vectorization engines, lightweight coroutines, and other processing.
- Equipped with a variety of online services
It supports the coexistence of rows and columns, and supports the KV point check function with high QPS, so that one data supports multi-dimensional analysis and KV point check at the same time.
Adaptive vector retrieval (such as Proxima) implements read/write splitting to ensure the isolation of high-performance and high-QPS queries from OLAP analysis.
- Supports interactive analysis of data lakes and data warehouses
Alibaba Cloud data lakes can interact with tables in data warehouses (such as MaxCompute and OSS) in seconds, eliminating the need for data movement and achieving rapid data acceleration.
It realizes the efficient synchronization of millions of rows of data per second, automatically discovers metadata, and improves user production and development efficiency.
- Rich ecosystem compatible
Relying on the PG ecosystem, it is compatible with mainstream BI tools and supports query, analysis, visualization and other operations on the data warehouse.
It supports the development syntax of PostgreSQL, provides standard SQL capabilities and rich BI tool extensibility, and has strong ecological capabilities.
1. Fixedplan high-performance printing update
Hologres has received significant improvements in real-time write capabilities, especially through the Fixed Plan mode for efficient data writes. This mode can be deeply optimized during the data writing process and directly write batch data to the storage engine.
On the 128C instance, we tested four different write scenarios, and the specific performance is as follows:
- 盲写(append only 模式):每秒可达 230 万 RPS。
- 写入包含主键且采用 insert or ignore 模式:每秒可达 200 万 RPS。
- Press the primary key for upsert operation: up to 800,000 RPS per second.
- When writing to a primary key and the data already exists, upsert operations based on large data volumes: up to 700,000 RPS per second.
These write capabilities demonstrate that Hologres can meet the requirements of high-performance data writing in current traffic scenarios.
2. Multiple performance optimizations, V2.2 version performance test results improved by 100%
Hologres achieved the world's No. 1 result in the TPC-H standard test, about 23% ahead of the second place, demonstrating Hologres' technical prowess and competitive advantage in this area.
In V2.2, the query optimizer and query engine are improved, and the total query time of TPC-H 1T using 96CU in V1.1 is 223.08 seconds, and in V2.2, the test result is 111.53 seconds, improving the performance by 100%.
3. Warehouse mode for elasticity and high availability
Since version 2.0, Hologres has been continuously optimized to improve user convenience and system stability. Currently, Hologres uses the compute group mode to implement read/write isolation on 1.x and further support write/write isolation in addition to read-read isolation. Users can freely divide resources according to their needs, such as offline import, ET processing, and Flink real-time write, and support dedicated computing groups for OLAP queries and high-QPS checks. All computing groups share storage at the bottom layer, support multi-instance region deployment, and automatic data replication, improving cost performance and fault isolation. In addition, it realizes the physical isolation of resources, and can flexibly scale up and down when performance bottlenecks or insufficient resources are encountered. In addition, if a warehouse is fully loaded due to complex SQL and other reasons, affecting the production environment, you can quickly switch the query to another computing group to ensure that business stability is not affected.
4. 流批一体 Dynamic Table
The upcoming HologresDynamic Table supports the ability to integrate streams and batches, especially in the field of traffic analysis such as ad attribution, which will greatly improve the integration of real-time and offline data processing. Hologres Dynamic Table is the core that enables seamless integration of real-time computing and batch processing on the same SQL and data source tables, improving the streaming batch processing scenario. A single engine can process stream and batch data, reducing component dependencies and improving O&M and computing processing efficiency. In traffic data processing, Dynamic Table can support automatic hierarchical pre-aggregation calculation of data warehouses, from the DW detail layer to the DWS light aggregation layer to the ADS application layer, significantly improving the efficiency of development, management, and cost control.
5. Multiple traffic analysis scenario functions
Hologres excels in analyzing traffic scenarios, especially in funnel analysis, retention analysis, tag portrait analysis, and user behavior tag analysis. Compared with open-source products such as ClickHouse and Doris, Hologres has significant advantages in funnel retention path analysis, tag portrait analysis, etc., and provides rich function support, such as retention analysis by summary, expansion by day, and natural day. In addition, Hologres supports the Roaring Bitmap function to help compute large-scale data such as UVs. At the same time, Hologres also provides powerful BSI function support, which further improves the convenience and efficiency of users.
03
Best practices for traffic analysis scenarios
Let's share some of our best practices for traffic analysis scenarios.
1. A typical real-time data processing architecture based on Hologres
Data collection: Collect user behavior data through tracking technology.
Message queue: Sends collected data to a message queue (such as Kafka or Alibaba Cloud's DataHub) for asynchronous processing and scaling.
Real-time processing: You can use real-time stream processing frameworks such as Flink to consume data in queues for real-time processing.
Data storage: The processed data is stored in real-time data warehouse products such as Hologres to support fast query and data analysis.
Data service and visualization: Based on the real-time data warehouse, it provides data storage services and supports data visualization to help users understand data more intuitively.
- Data warehouse model design - weak schema pattern
In the data warehouse model based on Hologres traffic analysis, the weak schema pattern is strongly recommended. This allows for the definition of the data warehouse model to separate commonly used columns as high-frequency general search fields, while lower-used columns can be consolidated into JSONB types. This flexibility helps improve development efficiency and respond to data changes.
Due to the large amount of traffic data, the query is usually not performed on a full basis, but on a periodic basis (for example, the last 1 day, 7 days, 30 days, etc.). Therefore, we recommend that you make full use of the cold storage capabilities of Hologres and store data with low query frequency in cold storage to reduce storage costs. At the same time, cold storage can also ensure that a large amount of data can be effectively managed without affecting query performance.
Since most of the buried data is based on events, the event name is a good filter. We recommend that you create an aggregate index in Hologres that contains the event names to improve query efficiency. At the same time, data is stored in columns, combined with dynamic partition management (such as TTL settings) to optimize query performance and data management.
- JSONB virtual widening of semi-structured data
For JSONB data, we have implemented an optimization feature to improve the efficiency of development, O&M, and business iteration, while solving the shortcomings of JSONB management. The main features include:
Virtual widening of JSONB data: We perform virtual widening of JSONB data every day to generate a view. This view dynamically lists all fields in the JSONB, so that users can query the view directly when querying, without having to worry about the structure inside the JSONB.
With the view generated by virtual widening, users can query JSONB data in the same way as a traditional wide table, with little to no modification of existing query logic. This feature is especially important when replacing a data storage system, such as migrating from an open source product to Hologres, as it significantly reduces migration costs. During the migration process, even though the old system used 400 columns to store, the new system may only need 60 columns as the primary table storage, and the remaining fields can still be flexibly managed through JSONB and views, ensuring query continuity and efficiency.
2. The use of Hologres in specific analytics scenarios
- User event analysis
Next, let's look at a few typical scenarios, such as in event analysis, you can select any time period based on the partition and specify the event name for query. The core focus is on counting the number of unique users (UVs) per day for these events. Hologres has demonstrated great computing power in this regard, especially since its 2.1 release, which has been able to efficiently handle UV calculations on large-scale data (such as billions of records) by optimizing the countdistinct/uniq operator to the extreme. This kind of query is usually completed in milliseconds, and even when the data volume is very large, the results can be returned in about seconds, which reflects the significant optimization and guarantee of Hologres query efficiency.
- Retention analytics
Through retention analysis, you can observe the login of users in different time periods (such as the first day, second day, and nth day), so as to understand the return visit rate of users and the user stickiness of products. This analysis is essential for evaluating product performance and guiding optimization decisions. To this end, Hologres provides powerful retention analytics functions that make SQL development easy and efficient while ensuring efficient query performance. These capabilities make it easier for technical teams to conduct retention analytics to make more accurate product optimization and decision-making.
- Funnel analysis
Funnel analysis is an important case in data analysis to observe the conversion rate and behavior patterns of data at different stages. For example, in the e-commerce scenario, you can analyze the conversion rate from click to order and then to pay the final payment. In the gaming industry, you can analyze the proportion of users who download a game to log in to it. Hologres provides comprehensive funnel analysis functions, supports global summarization and daily query, and allows users to view the full amount of data or the funnel by day in a table. In addition, the Hologres execution engine has been optimized to improve the performance of functions related to Funnel Analytics compared to the previous version, and the performance may be improved by more than 40% when migrating from the open source version. These features and performance optimizations make Hologres an ideal tool for efficient funnel analysis.
- Path analysis
The current version of Hologres supports powerful path analysis, which is especially important in traffic scenarios. Path analysis can record the user's access path in detail, including every step of the user's operation after entering the product, which helps to gain insight into user behavior patterns. Through path analysis, you can clearly see the inflow and outflow of each operation, providing refined operational data support for your business. This data not only helps companies understand access to their products, but also guides product iteration and optimization.
- Attribution scenario model
In the advertising and gaming industry, attribution models are an important reference for evaluating ad performance and optimizing delivery strategies. Attribution models determine the effectiveness of an ad network by analyzing the relationship between ad touchpoints, such as clicks, and conversion events (e.g., orders). The diagram below details the key steps in an attribution model, including identifying touchpoint and conversion events, setting a valid conversion period (e.g., 15 or 30 days), associating touchpoints and conversion events based on user IDs, and calculating the weight distribution on the conversion path.
In attribution calculations, there are challenges of large amounts of data and high query overhead, especially when two large tables (such as a touch table and a conversion table) need to be correlated, resulting in a large number of Cartesian product operations and consuming a lot of CPU and memory resources. As a result, the efficiency of attribution calculations is highly dependent on the database system's ability to JOIN.
Hologres has significant advantages in attribution scenarios, such as LOCAL JOIN and Runtime Filter Join, which can significantly improve the efficiency of queries and JOINs, so that the computational effect can be maintained even in large data volumes.
- Single-table attribution ideas
In the practice of single-table attribution, click data and conversion data are stored in Hologres data warehouse. This strategy realizes the correlation of data and the identification of attribution paths through the data processing capabilities at the data warehouse level. In order to achieve efficient data correlation and attribution calculations, it is important to take full advantage of Hologres' indexing and LOCALJOIN capabilities. These capabilities help to quickly match relevant data, and perform window calculations and sorting after a match to determine the attribution path.
- Multi-table attribution ideas
On top of Hologres, you can use the association between two tables for analysis, and you can also do multi-table association and multi-table attribution. Due to the large amount of data in advertising, the powerful real-time processing engine Flink is combined with Hologres to do real-time attribution.
In addition to supporting OLAP queries, Hologres also has efficient online analysis capabilities, especially when handling high-QPS point lookups.
- Long-period UV calculations
In long-period UV calculations, there are two common processing modes. The first pattern involves working with large-scale datasets, where when the amount of data reaches tens of millions or billions, it is often necessary to associate two tables, i.e., a fact table with a dimension table to calculate a specific UV value. In this scenario, due to the large amount of data and high concurrency requirements (QPS greater than 10), the performance and processing power of the system are required.
In the second mode, for scenarios with a small amount of daily data, a single-table query is used to calculate UVs based on the event table. Although the QPS may also be high in this scenario (e.g., more than 10 or 20), due to the relatively small amount of data, the system can optimize the query performance to meet the computing demand.
The real-time RoaringBitmap implementation steps are shown in the following figure:
The challenge of long-term retention in the gaming industry: You need to calculate long-term retention, i.e., login retention for every day of any year from the date a character is created. This kind of computation involves the intersection of bitmaps between any two days of a long period, which is highly computationally complex, and the traditional casewhen expression will lead to lengthy and complex SQL code, which affects development efficiency and computing efficiency.
Hologres' JSON function optimization: Hologres' JSON function is used to optimize this computation. This is done by first converting the ndays (days) and metric values to JSON-formatted data, and then using Hologres' JSONB handlers on the outer layer to expand and calculate.
It is recommended that developers in the game industry pay attention to and try this optimization method to improve data processing efficiency and development efficiency.
In the process of processing wide tables, data volume inflation caused by data association is often encountered, which not only increases the difficulty of calculation, but also seriously affects the query efficiency. To optimize for this challenge, we can adopt a strategy of embedding fine-grained data, such as product or brand information, into wide tables as arrays to avoid unnecessary data bloat. This approach not only preserves data integrity, but also significantly reduces data redundancy. In addition, with the complex data types (such as arrays) provided by Hologres and the efficient ARRAYBitmap indexing feature, we can perform fast intersection, union, and difference operations on arrays, which greatly improves query performance, and it is estimated that the performance improvement can be as high as more than 80%. This optimization strategy provides a practical solution to the data bloat and query bottlenecks encountered when dealing with wide tables, demonstrating the strength and flexibility of Hologres in the field of big data processing.
If you are interested in the Hologres product, you can search for the experience and try it for free on the official website of Alibaba Cloud.
That's all for this sharing, thank you.