laitimes

Presto in Zhihu's cache acceleration practice

author:Flash Gene
This article mainly explains the testing and selection of PrestoDB and Trino, the query acceleration of PrestoDB using Alluxio, and the deployment adjustment and cache acceleration practice of Presto in Zhihu.

1 Background

Presto, as a typical OLAP engine that separates storage and compute, has achieved a series of achievements in the field of big data. Although the deployment method of storage and computing separation has obvious advantages in resource scalability, it also brings a series of technical challenges. Among them, the speed and cost of data acquisition are particularly prominent.

The challenge of separation of storage and compute

In such an architecture, Presto provides rich and flexible connectors that make it easier to query heterogeneous data sources. However, due to the lack of a dedicated storage system, low-latency data queries are often difficult to guarantee. In addition, network and storage instability is exacerbated by the need to communicate frequently with external storage, which can affect the stability of data access.

Course

Zhihu's Presto has gone through several stages of PrestoDB -> PrestoSQL -> Trino, and now it needs to re-examine and adjust the current Presto architecture to solve Zhihu's growing complex analysis needs, mainly to improve the query performance of report scenarios, including but not limited to ad hoc query, exploration and analysis scenarios.

2 Architecture

2.1 Architecture positioning

Presto is an open-source, distributed OLAP SQL query engine that was originally developed by Facebook and open-sourced in 2012. Because of the rich connector features, it can support not only non-relational data sources, such as Hadoop Distributed File System (HDFS, Hive), S3, Cassandra, but also relational data sources such as MySQL, PostgreSQL, Oracle, etc. Standard ANSI SQL semantics are supported for these heterogeneous data domains, as well as complex queries, aggregations, joins, left/right outer joins, subqueries, windowing functions, and more.

The architecture of Presto is similar to that of a classic massively parallel processing (MPP) database management system, consisting of a coordinator node and multiple workers. The coordinator node is responsible for receiving query requests from clients and parsing, optimizing, and scheduling them. The task is then distributed across multiple worker nodes, with all processing taking place in memory and flowing through the cluster in a pipeline, avoiding unnecessary I/O overhead.

In the big data OLAP world, the main criteria for measuring an engine are data volume, flexibility, performance, and only one point. According to the introduction of the Presto architecture, it can be positioned as a data scenario with a small and medium-sized data volume, high flexibility, and moderate performance, which can give full play to its value.

Therefore, in practical business scenarios, Presto is especially suitable for scenarios that require fail-fast such as interactive analysis, report query, and data exploration. It is not suitable for scenarios that require high fault tolerance, such as large-scale ETL, data synchronization, data backtracking, and machine learning data modeling. At the same time, because the first task of using Presto in Zhihu is to speed up Hive queries, it is not recommended to undertake high-concurrency query scenarios.

In the end, we positioned Presto as a "read-only OLAP engine" in Zhihu Big Data, and deployed it under the computing domain architecture of big data. For other read/write scenarios, Hive or Spark is recommended, or other engines can be selected according to business needs.

Note: The figure below only has the Computing field and omits the other links.

Presto in Zhihu's cache acceleration practice

2.2 Deployment Method

Presto's deployment on Zhihu has gone through two iterations:

  • The resource is physically isolated.
  • Unified gateway routing mode for resources.

2.2.1 Physical isolation of resources

In the early days of PrestoDB and PrestoSQL, the system was mainly deployed in physical isolation due to the lack of efficient infrastructure operation and maintenance convenience and the load balancing mechanism of gateway routing to resources.

Although this method satisfies the convenience of business use, it also leads to a series of problems, such as increased maintenance complexity, uneven distribution of resources, increased change costs, and waste of resources.

2.2.2 Unified gateway routing mode for resources

To optimize resource management and simplify user access, we have introduced presto-gateway at the cluster entrance.

With this project, we were able to consolidate the independent domain names of the original N clusters into a unified domain name, which not only facilitated the subsequent AB testing, grayscale upgrade, and user-specific routing, but also significantly reduced the waste of resources.

In terms of basic O&M strategy, we integrated Terraform to transform traditional manual O&M work into a more logical software development process.

With Terraform's idempotent state management design, all operations can be done with a single command after code review. While this shift has slightly increased initial development costs, in the long run, it has saved us a significant amount of ongoing maintenance costs and reduced maintenance complexity by 99%.

It is important to emphasize that the "single command" mentioned here is not just some kind of simplified one-click operation, but is developed following Infrastructure as Code (IaC), which means that all deployments and subsequent changes can be done efficiently through this command.

Presto in Zhihu's cache acceleration practice

3 Objectives

Without increasing the cost of the machine:

  • There is a benign trend of decreasing query time.
  • The small and medium-sized query scenario (P50) of the report class has at least a 50% performance improvement compared to the previous one.

4 Cache Mode Selection

Our goal is to have a hard premise that we can't increase the cost of the machine, and according to this hard premise, there are two deployment modes as follows:

  • An independent Alluxio cluster can realize unified acceleration of the OLAP engine, which is not limited to Presto but also accelerates Hive and Spark, and finally provides unified data orchestration services.
  • RaptorX, a dedicated cache acceleration feature in PrestoDB, is available after the configuration is turned on.

4.1 Standalone Alluxio Cluster Mode

Allxuio: What it is

Alluxio is an open-source, distributed, hyperscale data orchestration system designed to accelerate data access for distributed computing workloads. It provides an efficient data access layer between memory and storage for fast data sharing between different data storage systems (such as HDFS, S3, Azure Blob, etc.) and computing frameworks (such as Presto, Apache Spark, etc.).

4.2 RaptorX

PrestoDB RaptorX: Cache Acceleration

RaptorX, formerly known as Raptor, is a Presto Connector (presto-raptor) designed to support some of the key interactive query scenarios in Meta (formerly Facebook). PrestoDB is a very flexible architecture that separates storage and compute, but it is difficult to provide low latency guarantees, especially when problems such as network and storage latency are magnified. To solve these problems, Raptor is designed as a shared-nothing storage engine for PrestoDB.

Note that the RaptorX and Raptor mentioned here are two projects. Raptor can be understood as an independent cache cluster, while RaptorX is a multi-level caching function, giving automatic caching capabilities when using functions such as metadata, file lists, raw data, etc.

By implementing its own storage, Presto is also taking a step forward on the path of a low-latency, high-throughput OLAP engine. At the same time, this architectural change also brings the following problems:

  • Uncontrollable resource utilization: The size of a Raptor cluster depends on how much data needs to be stored, but because of the schema change data coupled to compute, this creates new challenges for resource usage.
  • High engineering and operational costs: While Raptor has its own storage, which is beneficial for data-related scenarios such as data warehousing work, it requires an additional storage system for Presto to maintain.
  • Potential security issues: As security and privacy requirements increase, it becomes even more important to achieve a unified security and privacy policy. The use of a separate storage engine makes it extremely difficult and fragile to enforce such policies.

So since 2019, Meta engineers have been rethinking the future of Raptor, is it possible to benefit from local storage without the cost increase of a tightly coupled storage-compute architecture? The final direction is to add a new local caching layer on top of the native data warehouse. The project is named RaptorX (details https://prestodb.io/blog/2021/02/04/raptorx).

Technically, the RaptorX project has nothing to do with Raptor. The main differences are the use of local storage versus a separate storage system, which provides the following benefits:

  • Presto eliminates the need to manage the data lifecycle;
  • Data loss due to a single worker failure has little impact on query performance;
  • Caching, as a feature of the file system layer, is part of the presto-hive connector and operates in the same way as a regular Presto cluster, reducing operational costs.

PrestoDB RaptorX: Features

  • RAW DATA CACHING: YOU CAN CACHE THE RAW DATA REQUIRED FOR THE QUERY, AND THERE ARE 2 CACHING MODES: FILE_MERGE AND ALLUXIO;
    • FILE_MERGE is the use of local storage.
    • ALLUXIO 使用嵌入式的 ALLUXIO,一般称为 Alluxio Local Cache。
  • Metadata caching: Caching information such as Hive tables/partitions in the Coordinator;
  • File List Cache: Caches a list of files from the remote storage partition directory;
  • Shard result caching: cache part of the computational results on the local storage of the worker node, such as COUNT DISTINCT;
  • Footer caching: Cache ORC and Parquet Footer information in the Worker node's memory, most of which are frequently used when read;
  • Affinity scheduler: Gives affinity to queries and caches, and schedules queries to the nodes where the cached data is located based on hash modulo or consistent hashing, so as to improve the cache hit ratio.

4.3 Choosing a Conclusion

While there are several advantages to a standalone Alluxio cluster, integrating a standalone Alluxio cluster into an existing environment can be costly operational challenges in terms of networking, disk, and replicas, given that we currently use Presto running in a Kubernetes (K8s) environment. In view of the initial considerations, in order to avoid over-investing resources in these areas, we chose the current cost-effective solution, using PrestoDB RaptorX.

Maintaining a complete cluster independently is undoubtedly more cost-effective in the short term than simply making configuration adjustments. However, in the long run, choosing a standalone Alluxio cluster will bring a significant advantage in the OLAP space of big data.

5 OLAP performance test

5.1 Status

From previous discussions of the RaptorX project and related case studies in the community, it is possible to use Alluxio (standalone or embedded) to optimize and mine idle resources in the server to improve query speed.

After adjusting Presto's cluster deployment strategy, Zhihu found several underutilized resources. These resources provide us with a great opportunity to address some of the legacy issues of early service.

Thanks to the optimization of the cluster deployment strategy, we now have the ability to quickly deploy and switch between PrestoDB and Trino, and can also deploy heterogeneously.

Based on this background, we decided to conduct an in-depth comparative test again to re-evaluate the performance of PrestoDB and Trino in OLAP performance in order to provide better data query services for our users.

5.2 TPC-DS 测试

Version 5.2.1

The beta version is PrestoDB 0.280 vs Trino 416 (they are the latest versions for the period 2023-05).

5.2.2 Cluster size

  • 96C 192G cloud hosting;
  • 500GB TPC-DS 数据集,造数项目 spark-sql-perf 和 tpcds-kit;
  • The data simulation production has snappy compression enabled, and the data formats are Parquet and ORC;
  • Trino 使用了jdk17,PrestoDB 使用了jdk11;
  • 1Coordinator 3Worker and 6Worker are deployed in two deployment modes (independent deployment), the heap memory of the Worker is 64GB, and other parameters are consistent with the production as much as possible;
  • Tested with native PrestoDB or Trino Java clients;
  • The test cases are provided by tpcds in the source code.

5.2.3 TPC-DS Test Results

The TPC-DS dataset extracted 20 SQL statements, aligned the configurations, and conducted multiple rounds of testing.

Presto in Zhihu's cache acceleration practice

5.3 Business Testing

5.3.1 Test scale of production business

In the production environment, because of the convenience of gateway and multi-cluster deployment, we picked up the resources of a sub-cluster and deployed PrestoDB, and conducted comparative tests in our spare time.

  • k8s containerized deployment;
  • 1Coordinator 64Worker,Worker 堆内存 64GB;
  • Other parameters are aligned with Trino.

5.3.2 Production AB Test Results

The test case randomly extracts 20 SQL statements from the business, which represent different business scenarios, including reporting, analysis, and mining, which can almost reflect the real business scenarios.

Presto in Zhihu's cache acceleration practice

5.4 PrestoDB Grayscale

Again, the production environment started multiple rounds of AB testing, taking 20%, 40%, 50% of the resources for AB testing.

The test found that there was an average performance improvement of 1.5 ~ 2 times in the P50 scenario, and only a small improvement in the P90 scenario.

Presto in Zhihu's cache acceleration practice

5.5 PrestoDB Production Optimization

5.5.1 Why TPC-DS is better than Trino

Trino's community development has accepted the opinions of many users, and many details such as ease of use have been developed in advance, and we found that the following points are better than PrestoDB in this test.

5.5.1.1 Dynamic filtering

Trino will optimize some scenes in TPC-DS for dynamic filtering, and the optimization effect is especially obvious if the scene is hit.

Although PrestoDB also has this feature, it is understandable that Trino has increased its development investment and is leading the way. However, this scenario particularly tests the design of the data warehouse, and it is difficult to hit this scenario in our actual business test.

For details, see: https://trino.io/docs/current/admin/dynamic-filtering.html

5.5.1.2 Enjoy Spark stats bonuses

Trino can benefit from the stats of Spark ETL output.

This statistic is directly written into the DDL information of the Hive table by Spark ETL after the data is generated, as shown in the following figure:

Presto in Zhihu's cache acceleration practice

Because Trino can benefit from this bonus, it will be faster than PrestoDB, and we tested that after removing these statistics, the query time returned to a flat state.

For details, see: https://github.com/trinodb/trino/pull/16120

5.5.1.3 Refer to Trino's optimization parameters

After comparison, it is found that many of the default parameters of Trino are more developed than PrestoDB, for example, PrestoDB is still an experimental parameter, which has been iterated for several rounds in Trino and officially launched. So when we tested, many of the parameters of PrestoDB were tuned with reference to Trino.

The main parameters are as follows:

  • 下推类 hive.pushdown_filter_enabled=false,hive.parquet_pushdown_filter_enabled=true
  • 下推类 hive.enable-parquet-dereference-pushdown=true
  • 下推类 hive.parquet.pushdown-filter-enabled=true
  • 动态过滤 experimental.enable-dynamic-filtering=true
  • 查询类 query.max-stage-count=150
  • 分区类 hive.ignore-unreadable-partition=true

5.5.2 PrestoDB RaptorX Final Adjustments

After several rounds of testing the PrestoDB RaptorX project, we finally enabled the following features:

  • 开启软亲和 hive.node-selection-strategy=SOFT_AFFINITY;
  • 开启原始数据缓存 ALLUXIO 模式,关闭版本校验,关闭 cache.default-cache-quota 只开启 cache.alluxio.max-cache-size;
  • Enable Hive metadata caching mainly use the hive.metastore-cache-ttl time for cache updates;
  • Enable sharded caching;
  • Because the raw data cache uses ALLUXIO, the file list cache and the footer cache are turned off.

5.5.3 Select Conclusion

While Trino is finely tuned on several details, and the default parameters are better than PrestoDB. Its basic interconnection code is very user-friendly, such as company authentication docking, resource queue management, and client compatibility. However, considering that Trino does not currently support the Alluxio Local Cache feature (i.e., embedded caching), and our need to go live quickly in the short term, it is clear that changing the configuration is more cost-effective than creating a new Alluxio cluster.

For these reasons, we decided to use PrestoDB instead of Trino. Thanks to the convenience of basic O&M, you only need to replace the relevant image during the deployment process.

To be clear, this doesn't mean that Trino can't be used with Alluxio, just that Alluxio's Local Cache feature is currently limited to running in PrestoDB.

注:Trino 针对 Alluxio Local Cache 的支持,已经在进行中。

6 Challenges and Problems

With the convenience of PrestoDB RaptorX, we completed the architecture iteration and upgrade of the Presto cluster in a very short period of time, and also encountered the following representative problems:

6.1 Caching Custom Rules

6.1.1 Background

While enabling RaptorX in PrestoDB can significantly improve query speeds, there is still room for flexibility in the configuration of raw data caches. Currently, the main configuration for caching is limited to cache type, cache size, and cache hit detection for Shadow Cache, and does not provide users with more fine-grained control.

The Alluxio community provides a CacheFilter interface that allows users to customize cache filtering rules to achieve custom control over the cache.

For example, in some specific business scenarios, such as report generation, ad hoc query, data mining, and multi-dimensional analysis, if proper cache management is not carried out, it may lead to competition for cache resources and frequent cache flushing. To solve this problem, we developed a set of custom filtering rules based on path prefix matching based on the CacheFilter interface.

Despite the ability to customize the cache, the partition range of the query is still a variable, and the cache space is always limited, and we can't preload all the data to be queried into the cache, so how do we develop an effective caching strategy in the face of such a challenge?

6.1.2 Cache Design Goals

In this cluster, there are a total of 43T of space available for cache usage.

On the premise of using up the cache space as much as possible, the cache hit ratio should reach more than 60%.

6.1.3 Design Ideas

In order to ensure maximum revenue from caching with limited resources, we opted for a strategy based on ROI (return on investment) to filter database tables. This strategy is designed to ensure that cost-effective databases and tables are cached first, and dynamically adjust the caching strategy based on the actual cache hits to better meet the actual query needs of users.

We have analyzed the platform data and related statistics in depth, and identified the following high-frequency data query ranges:

Presto in Zhihu's cache acceleration practice

Based on these analyses, we decided to break down the granularity of the caching policy into 1 day, 30 days, 60 days, and 90 days of updated data. For further data exploration, users are particularly frequently queried for data updates over the 30-day and 90-day periods. Given the actual limitations of the cache space, we have tentatively decided to prioritize a data-centric caching strategy that is updated within 30 days.

The process is roughly as follows:

  1. The access popularity of databases and tables is obtained through audit logs.
  2. Combined with the average daily data increment of the database table, the ROI value is further calculated.
  3. Depending on the cache granularity, such as 30 days or 90 days, window accumulation is performed.

Using the above method, we select the cache range for the database table based on the sorting of ROI for the cache rule configuration. However, in the process of practice, we also realized that the range of choices does not depend entirely on the numerical value, but also needs to be combined with practical experience. At the same time, in order to ensure continuous cache optimization, we also tend to introduce a real-time monitoring mechanism for caching effect verification.

6.1.4 Validation of Cache

After defining a custom filter rule based on the above method, you can refer to the default monitoring provided by the Alluxio community to adjust the rule to dynamically select the cache range.

For reference, please refer to the following:

Presto in Zhihu's cache acceleration practice

6.2 Cache hit ratio issues

6.2.1 Background

After turning on caching, we encountered a new issue. Because the presto-gateway routing strategy only has random routing, queries may be routed to clusters that are not cached, or related queries will not enter the cluster after the cluster caches the data. As a result, the actual cache hit ratio of the query is low, the amount of cached data is large in and out, and the acceleration effect is not ideal.

A SQL that can achieve cache acceleration needs to be triggered about 25 ~ 30 times to achieve full sub-cluster cache coverage and achieve second-level return.

At the same time, due to the statistical caliber of Alluxio Local Cache's monitoring metrics, it is easy to cause information confusion and distort the judgment of the cache hit rate, which will mistakenly think that the cache hit rate is high, but the actual query is very slow.

6.2.2 Identify the problem

  • Monitoring metrics: The current Alluxio Local Cache mode cache hit ratio is on the worker side, and the statistical method is the ratio after the cumulative value, which will cause information confusion to the real cache hits, lack of actual times, and lack of time window granularity.
  • Routing policy: For the time being, there is only one query routing policy for presto-gateway, random routing.
  • Cache warm-up: When PrestoDB receives queries, if there is no cache, it is slower than Trino, and most small and medium-sized queries can reach seconds after hitting the cache.

It can be preliminarily judged that the main problem lies in the presto-gateway + multi-cluster mode, and the secondary problem lies in the lag of the caching capability for data perception and the lack of warm-up. In addition, although the problem of cache hit ratio does not have a direct impact on the speed, it will cause misjudgment for maintenance and the inability to accurately judge the current cache hit situation.

The cache warm-up problem is expensive to develop, so it is not included in this optimization. For the time being, only monitoring metrics and routing policies are optimized and adjusted.

6.2.3 Monitoring Indicator Issues

After a period of maintenance, the optimal monitoring metrics of the PrestoDB multi-level cache are as follows:

  • SQL-level cache hit ratio (for example, if one SQL statement hits the cache, an operator will participate in the statistics, or which operators in the SQL statement use cache acceleration to account for the proportion of it);
  • raw data cache hit ratio;
  • sharding result cache hit ratio;
  • metadata cache hit ratio;
  • The amount of cached swaps in and out.

Referring to the presto-cache module, we can add the dots of monitoring metrics to report the number of hits and misses, so that we can combine metrics with time windows for statistics.

After the modification, we found that the actual cache hit ratio should be a time window fluctuation value, not a long-term trend value. At the same time, it was also found that the actual hit rate was lower than the previous hit rate.

In the early days, using the metrics in Allxuio Local Cache, the hit rate could reach 60% ~ 70% or even higher on average.

Presto in Zhihu's cache acceleration practice

However, according to the re-statistics, it was found that only about 30% was actually found.

Presto in Zhihu's cache acceleration practice

6.2.4 Routing Policy Issues

Because the current gateway routing strategy is rudimentary (only one random route), we need to optimize the routing strategy with the ultimate goal of distributing similar database and table queries to the same cluster.

This raises the following new questions:

  • Routing policy: According to what rules and policies, queries will be fixed to a cluster?
  • Resource balancing: Because queries are large and small, how can they be balanced to avoid hot clusters? What should I do if I use a rule to route a query and the corresponding cluster resources are full?

In this disassembly, it can be judged that we should carry out hash according to a certain condition and then route, so as to ensure that every time the same query comes in, it will fall into a certain cluster, and then associate it with the resource pressure judgment of the relevant cluster in combination with the load state of the cluster, at this time, the routing path of the query should be fixed and form a loop with the initial route, so as to ensure that if all the clusters have judged once and found that the route cannot be carried out, it can fall back to the original hash routing cluster. There is no impact on other clusters.

Adhering to the principle of low-cost and fast verification, we finally chose the following:

  • Routing policy formulation: Select Sort all related table names in the query, and take the first hash to take the modulo routing method.
  • Resource Balancing: Select the load based on fault (load) switchover mode, which is tentatively set to a maximum of one round of polling.

The most intuitive thing after the launch is that the hit rate of the sharded cache has increased from 5% to 30%, which can prove that a large number of queries are routed to the cached sub-cluster.

Presto in Zhihu's cache acceleration practice

6.3 ORC Parquet Footer 缓存取舍

6.3.1 Background

After going live for a while, we frequently encountered the following errors, which caused the query to fail:

com.facebook.presto.spi.PrestoException: Error opening Hive split hdfs://hdfs01:8020/***/***/***/b7f59389-6237-44a1-995d-6ffbbd72bf51.parquet (offset=0, length=8859277): Unexpected error in parquet metadata reading after cache miss
	at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:355)
	at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:555)
	at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:431)
Query 20230720_054938_01149_ccwvt failed: Error reading from hdfs://hdfs01:8020/***/date=2023-03-15/hour=14/000316_0 at position 10929617.           

6.3.2 ORC Parquet Footer 缓存调整

We've found that when you turn on the Alluxio Local Cache, this metadata is also cached as part of the original data.

Therefore, those databases and tables that are updated frequently are particularly prone to this problem.

Eventually turned off this setting and the issue was resolved.

6.4 PrestoDB lags behind some of Trino's optimization points

In this migration from Trino to PrestoDB, we found that there are several key docking points that are particularly important when interfacing with the OLAP engine:

Note: Only the two most critical docking points are listed here.

  1. Account Security Authentication Integration.
  2. Audit log integration.

In our use of Trino, the development cost of integrating these two features was relatively low. This is because the Trino community has anticipated the less stringent security authentication methods that many companies may use in their internal networks (i.e., users can log in by simply entering a username and password without the need to provide additional credentials). As a result, Trino has a built-in specific internal network authentication mode that makes simple authentication function integration extremely simple. In contrast, PrestoDB doesn't offer similar functionality in this regard, which led to a lot of code porting to Trino's code to meet this need.

In terms of audit logs, we can use the database table information and queuing information automatically parsed by the engine to perform in-depth data analysis such as lineage analysis, database and table behavior evaluation, and saturation analysis. However, there are some gaps in PrestoDB in this regard. To solve this problem, we again referred to the relevant features of Trino and did a lot of code porting.

In general, although PrestoDB has its own unique advantages in some aspects, in the actual production scenario, Trino does show more user-considering investment and dedication in terms of basic easy-to-use docking functions.

6.5 Varchar type Chinese matching

The test found that in PrestoDB, it is not possible to use Chinese to compare and judge fields, and can only use trim, like and other functions to nest one layer to query.

Presto in Zhihu's cache acceleration practice

This is a serious bug in PrestoDB, and we can refer to Trino for fixing.

For details, see: https://github.com/trinodb/trino/issues/3517

6.6 UDF functions are missing

Compared with Trino, PrestoDB has the following udf functions that are not aligned, and you can refer to the port:

  • parse_data_size;
  • concat_wc;
  • format;
  • format_number。

6.7 Stability and Configuration Adjustments

PrestoDB RaptorX's hierarchical caching feature uses more memory than Trino.

For example, if you use the memory mode, a lot of com.google.common.cache.LoadingCache is used, so you need to configure more memory, otherwise it is easy to OOM.

Presto in Zhihu's cache acceleration practice

6.8 PrestoDB Hadoop 3 Support

Zhihu's current Hadoop has been upgraded to the 3.X stage and has opened Hadoop EC functions on a large scale.

Because Trino iterates quickly, it follows up on the Hadoop version it relies on early and supports EC.

PrestoDB is lagging, so the following problems occur when using it:

com.facebook.presto.spi.PrestoException: Error opening Hive split hdfs://hdfs01:8020/***/***/***/000019_0 (offset=100663296,length=33554432): Unexpected error in parquet metadata reading after cache miss
	at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createParquetPageSource(ParquetPageSourceFactory.java:355)
	at com.facebook.presto.hive.parquet.ParquetPageSourceFactory.createPageSource(ParquetPageSourceFactory.java:555)
	at com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:431)           

The status of the file as an EC:

Presto in Zhihu's cache acceleration practice

Because PrestoDB still uses hadoop2 packages, it cannot read EC files.

Presto in Zhihu's cache acceleration practice

Here is also in the https://github.com/prestodb/presto-hadoop-apache2 refer to the 3.2.x branch inside, and modify the code of trino https://github.com/trinodb/trino-hadoop-apache, and the query can be completed after repackaging and deployment.

Note: This method is also common for all packages that need to use Hadoop in the project, and can also be replaced as needed.

7 Summary

Compared with the previous ones, the performance of small and medium-sized query scenarios of report class has been improved by at least 50%, and most of the daily P50 scenarios are small SQL query scenarios of report type, and most of them can be completed within 5 seconds.

From the perspective of data platforms, the query time is decreasing:

Presto in Zhihu's cache acceleration practice

P50 Perspective:

Presto in Zhihu's cache acceleration practice

With Alluxio Local Cache, we have not only improved query performance, but also reduced bandwidth and I/O pressure on remote storage. By intelligently caching hotspot data to local storage, the need for remote data reads is greatly reduced, further reducing operational costs.

This successful practice also proves that the rational utilization and configuration of hardware resources play an indispensable role in improving the overall performance in big data query optimization. With the right identification and utilization of the potential of existing resources, significant performance gains can be achieved without additional hardware investment.

What's even more gratifying is that the user experience improvement brought by this performance improvement in real business applications is very obvious. For business scenarios such as daily report query and data analysis, the shortened query response time means that users can explore and make decisions on data more efficiently, thereby further improving the overall work efficiency.

We also hope to provide some inspiration and reference for other enterprises in big data query optimization. In the future, we will continue to conduct in-depth exploration and research, and strive to further optimize query performance in more scenarios and conditions, so as to provide more stable and efficient support for our business.

Finally, and most importantly, all of our speed optimizations are already on regular SSDs. The cost of NVME on the market is also gradually decreasing, and it is easy to use NVME for caching completely. At that time, Alluxio Local Cache will be like a performance beast, and Presto will be more valuable.

8 Future Prospects

8.1 Continuously optimize the cache hit ratio

In the existing environment, a deep understanding of query behavior and the adjustment and pre-heating of caching rules are the prerequisites for optimization and acceleration at this stage.

In the future, we will introduce more cache analysis, and at the same time, we need to use more convenient cache updates and applications to achieve the purpose of continuous optimization.

8.2 Effective Governance

Focus on the key challenges that Presto faces in the big data environment, small files. Planning small file governance, effective merging, and reasonable ETL are not only helpful in Presto, but also good for the entire big data platform.

Similarly, in conjunction with Presto, it is necessary to continuously provide query governance functions to ensure the optimal use of system resources. With real-time query tracking, intelligent resource allocation, and advanced queue management, you can protect your business.

8.3 Vectorization

The introduction of vectorization has played a revolutionary role in the OLAP engine, marking a new era of query processing speed.

In the future, we will consider introducing velox for query acceleration for Presto.

9 See also

Soon 中文bug https://github.com/trinodb/trino/issues/3517

RaptorX https://prestodb.io/blog/2021/02/04/raptorx

RaptorX https://github.com/prestodb/presto/issues/13205

Affinity hive https://github.com/prestodb/presto/pull/14095

Affinity https://github.com/prestodb/presto/pull/13966

Soon 分片缓存实现 https://github.com/prestodb/presto/pull/15155

Dynamic filtering of related https://trino.io/blog/2020/06/14/dynamic-partition-pruning.html

Alluxio Local Cache https://github.com/Alluxio/alluxio/pull/10748

Alluxio 官方配置参考 https://docs.alluxio.io/os/user/stable/en/reference/Properties-List.html

Alluxio Shadow Cache https://zhuanlan.zhihu.com/p/441933685

Alluxio 监控指南 https://zhuanlan.zhihu.com/p/544775228

Alluxio mines idle performance https://zhuanlan.zhihu.com/p/628091188

Alluxio Consistent Hash https://www.alluxio.com.cn/consistent-hashing/

利用 Alluxio 数据缓存降低 Soon 延迟 https://zhuanlan.zhihu.com/p/526847158

Presto Fragment Result Cache 剖析 https://zhuanlan.zhihu.com/p/527057990

Query Acceleration Practice https://mp.weixin.qq.com/s/cWIdal_Szh8fgZTN6gBpig for OLAP scenarios

Presto's caching mechanism https://zhuanlan.zhihu.com/p/196398077

Uber https://zhuanlan.zhihu.com/p/442876203

金山云 Presto 如何与 Alluxio 搭配 https://zhuanlan.zhihu.com/p/53

Author: Chen Dacang

Source: https://zhuanlan.zhihu.com/p/661179006