In this article, we would like to discuss the following questions:

Under the existing data engineering system, what exactly does data warehouse mean?
OLAP在Data Engineering 中处在什么位置？
Why is there a data lake?

Regardless of the various modeling methodologies and the implementation of data warehouses, DW provides two main capabilities:

Centralized and unified storage of data
Warehouse modeling, using data to portray the business, for analysis and decision-making

The "digital warehouse" is no longer the original "digital warehouse".

Diverse data sources

With the popularization and development of the mobile Internet, the main body of data analysis has changed from the original transaction data to behavioral data, that is to say, most of the data currently analyzed is user behavior data, especially when the popularity of mobile devices to a certain extent, the number of APPs has declined, everyone is competing in the stock market, and the grasp of user behavior habits has become more important, typical of a series of tools of Growth Hack. The infrastructure is also different from the past, in the early years, when it came to DW, everyone talked more about OLAP databases such as GP and Vertica, but now DW infrastructure refers to a collection of infrastructure and engineering solutions, which includes:

Behavioral data collection and reporting
Streaming ETL
DataSync (RDBMS -> Hive)
Data governance

Blood relationship
Data quality
Metadata center, data map
Indicator system
modeling tools, and so on

Model building
A foundational component of the Hadoop ecosystem

Under the current data engineering system, the underlying infrastructure is not a database, but mainly includes the following components: distributed file system, batch computing engine, and resource scheduling. From the perspective of storage, HDFS is a Read On Schema system, on the one hand, HDFS is more friendly to the writing of some original data, and the shortcomings are obviously insufficient analysis capabilities, so various SQL On Hadoop computing engines also came into being, but in any case, the computing engine solves the problem of how to accelerate batch computing. The cost of data migration is extremely high for multiple engines for one data, and this is still the case for most data platforms today.

Okay, now we can talk about the OLAP engine, the OLAP engine - a database based on a multi-dimensional data model optimized for analytical workloads, so the OLAP Server is essentially a DW. Returning to the lack of analysis capabilities we mentioned above, the essential difference between the OLAP engine and various computing engines (Presto, Hive, Spark SQL, and MapReduce) is that it has its own storage, which is a real database, so the query response speed is better than that of the computing engine. Under the existing Data Engineer system, the OLAP engine is more like a "storage acceleration layer", which makes up for the lack of capabilities on the underlying infrastructure - the ability to write in real time and the speed of query and analysis. However, as we mentioned earlier, one of the functions of DW is that data should be stored centrally, after the introduction of the OLAP engine, which data is stored in OLAP and which data is stored in HDFS? It's more of a question of cost and trade-offs. This problem can be analysed from two perspectives

From the perspective of warehouse modeling and analysis, most Internet/mobile Internet companies use dimensional modeling, and most of them will degenerate the dimension, and the final delivered model, if it is more convenient to analyze, must be a wide table that can connect business processes and have the same data granularity. Therefore, DW (DWD, DWS) and the tables above it are suitable for importing the OLAP engine, of course, the specific situation depends on the business situation, and whether different business units need to do correlation analysis.
In terms of the capabilities and characteristics provided by the OLAP engine, MOLAP, ROLAP, AND HOLAP have their own focus on scenarios, taking ClickHouse and Druid as examples, their support for Join is limited (Druid does not support it), while Doris does not. If you want to use a star model or constellation model to perform multi-fact correlation analysis, you also want to connect to BI products on top of the OLAP engine and ensure query performance. It's hard.

What problem DataLake solves

DataLake is not a new concept, its original problem is to store semi-structured and unstructured data in a unified manner, but engineers who have grown up under the Hadoop system are indifferent to this, why is HDFS not a natural "lake"? You can stuff any data into it. Therefore, under the Hadoop system, DataLake in the narrow sense refers more to various table formats (Hudi, Iceger, etc.), which rely on the column storage format at the bottom and rely on the atomic operations provided by HDFS, so that the original file system has the ability to use databases. At the same time, the original Lamdba architecture does not need to maintain two sets of data processing links, making the architecture more streamlined.

With these changes in capabilities, Lake has strengthened the real-time nature of data on the one hand, and on the other hand, the so-called "real-time data warehouse" has been eliminated. Real-time data warehouse is also a very strange word, think about why? Many enterprises only store the amount of traffic separately (Kafka->ClickHouse) in the so-called real-time data warehouse. The data in the data warehouse is not centralized, which is not easy to analyze. If you have to do this, you have to have a federated query engine to make up for it, see? Data warehouse engineers who deliver business value do not understand the underlying infrastructure, what capabilities the infrastructure provides, how to make good use of the underlying capabilities, how to model and design, and so ......on. At the same time, even though the architecture has become lean, and the dependencies of the data processing are complex enough, I just want to count them, and I can't count them, which is so hard.

How is the user's problem solved?

Until now, even the big-sized companies have been plagued by the following problems in their Bigdata divisions.

It is difficult to clarify the relationship between technology costs and business benefits, and a sustainable control of TOC (Total Ownership Cost) is needed
The ability to deliver data has been poor

Which business benefits are due to the improvement of data capabilities? Once the engineers in the scale data department are able to produce truthfully without delay, it is very good, and there is no business value at all. Business growth leads to an increase in storage and computing costs, and the improvement of data capabilities promotes business development in a better direction, which should be a benign process. It's still hard to use the data.

Summary: There have been successful cases of data construction around the Hadoop system, but there are also historical reasons, from a personal point of view, it is more like a decision made under a certain technical trend, with the retirement of many Hadoop open source projects, and Cloud is accepted by more and more large enterprises (state-owned enterprises, banks, insurance), lower cost, more business value-added products will gradually come to the front. Unbelief? Look at Snowflake and Databricks, but this must have a long way to go, and for more OLAP vendors, how to integrate with the open source ecosystem in the short term and how to reduce migration costs is still a question to consider.

作者:Zhen Pang

Source: https://zhuanlan.zhihu.com/p/379978263

The "digital warehouse" is no longer the original "digital warehouse".

Diverse data sources

What problem DataLake solves

How is the user's problem solved?