DataCake across multi-cloud big data platforms in detail

In just two years, SHAREit Technology's product SHAREit (Domestic SHAREit Express) has more than 1.2 billion global users, and in 2019, SHAREit has more than 1.8 billion global users. Up to now, the global cumulative installed users of the EGGPLANT Technology product matrix are nearly 2.4 billion. The massive scale of data and the rapid development of corporate business have put forward complex and demanding requirements for big data platforms.

This article will introduce DataCake, a cross-cloud self-service big data platform independently developed by the big data team of SHAREi Technology. It mainly includes the following three parts:

1. Background & challenges of big data platforms

2. DataCake solution

3. DataCake future planning

Sharing Guest|Shaoquan Zhang, Technical Director of Big Data Department, Eggplant Technology

Editor|Leo Sunline Technology

Production Community | DataFun

Background & Challenges

1. Background

DataCake across multi-cloud big data platforms in detail

DataCake was born and developed to solve the data needs of enterprises, and the importance of data to enterprises has become self-evident. It can be summarized into the following three aspects:

(1) Data volume: data has become the most core asset of enterprises, while the amount of data produced globally still shows exponential growth;

(2) Application scenarios: The application scenarios of data are also broader and deeper, data-driven has been reflected in the assembly line of the whole product process, and data science has become the fourth paradigm of scientific research;

(3) Data potential: The potential of the data analytics market is equally huge, with data warehousing and data science still in demand for capital, with research suggesting that data-driven businesses can get an additional 30% data growth per year.

2. Challenges

There are many challenges in how to realize the value of data. In the communication between the DataCake team and the internal demander side and external customers, the challenges can be divided into three categories from different perspectives:

(1) Business leader

Business digitalization is easy, but this only completes the data record, and realizing data commercialization is the key to turning waste into treasure. The operating costs of big data lines remain high, and a large number of historical tasks and business operations need to be maintained.

(2) Data analyst and data scientist

In most companies, the big data department exists as a centralized middle office department, and the data requirements and development require cross-team communication between the business department and the development department, and the process is complex and the development scheduling cycle is long. There are many ecological components of big data technology, and the technical foundation of analysts is weak and the learning cost is high.

(3) Technical leader

Business teams quickly try and make mistakes, database table ETL tasks are rapidly expanded, and task ownership, kinship dependencies, and data permissions are confused. There are many big data and cloud computing products, and the technical architecture is complex and opaque to users.

The above challenges can be summarized into two major problems:

(1) Data cannot play a value: a lot of costs have been invested but the benefits of data in business value cannot be clearly seen;

(2) Data governance is inaccessible: Complex business requirements, many historical operations, scattered big data technology components, and difficult data system construction.

There are three sets of data that illustrate the harm of these two problems:

(1) 66% of data is unutilized;

(2) 84% of managers do not believe in the value of data;

(3) 70% of enterprises do not have an efficient data architecture.

solution

1. Data mesh thinking

To solve the above problems, DataCake introduced the idea of data-driven organizational architecture, Data Mesh. The idea is to promote the change of the company's organizational structure through software architecture. One of the core ideas is to transform centralized data teams into domain-driven ones, and let the business be responsible for data.

In the traditional centralized data team organization, where a data team within the enterprise serves multiple business units, Data Mesh is a distributed way of data collaboration, that is, domain-driven. The main change is that each department owns the relevant data and is responsible for its own data. Data Mesh accomplishes this in three ways.

(1) Self-serve platform: Using the self-service data platform, the business can easily implement the development tasks of related data requirements.

(2) Data as a Product: Data productization can promote data collaboration across teams and improve the efficiency of data utilization.

(3) Federated Governance: In addition to distributed data development and application, a centralized data governance mechanism is also needed to ensure the security and quality of data.

DataCake is a big data platform built on the idea of Data Mesh, and in the practice of this software idea, it has also promoted the change of organizational structure. Relevant department heads can combine their actual needs and domain knowledge to achieve agile trial and error and rapid iteration.

2. DataCake implements functions in four general directions

Specifically, DataCake implements functions in four broad directions:

(1) Self-service big data application platform

It provides business personnel with a low-tech cost solution to use data, build a task pipeline in a low-code way, and complete the development of data warehouses; Unified data analysis; Data visualization, custom report development, and more.

(2) Data intelligent governance and security management

Provides multi-dimensional data cost billing, and uses intelligent engines to assist data governance and data permission management.

(3) Unified data management platform

Unified management of metadata information, construction of data asset catalog, so that data can be collected, retrieved, and used, breaking data silos between different departments. In addition, data quality monitoring functions are provided to ensure that data is valid and available.

(4) Integrated architecture of the lakehouse

The data generated by the business is directly fed into the lake, and the detailed data can be directly analyzed to reduce the cost caused by the construction of the pipeline. At the same time, data that is not timely, data can be further built.

3. DataCake technology architecture at a glance

DataCake is a cloud-native, multi-cloud, all-in-one data warehouse platform.

(1) At the infrastructure level (IaaS)

DataCake builds on IaaS provided by existing cloud computing vendors to provide a unified application service layer across different cloud providers/on-premises data centers. This not only makes full use of the advantages of different cloud providers, but also avoids excessive dependence on a single vendor and the emergence of vendor lock-in.

(2) At the platform level (PaaS)

DataCake provides a cross-scenario serverless computing platform that supports a wide range of compute engines including Adhoc, batch processing, real-time stream computing, and cloud vendors' own interfaces. It also provides efficient cluster management functions, which can be easily scaled horizontally and vertically.

(3) At the service level (SaaS)

DataCake supports many compute engines and data application platforms including HUE and Tableau, so that different application scenarios and applications can be used, and the best computing engine can be intelligently selected according to the application scenario. On the other hand, for system managers, DataCake also provides management portals to support cloud resource management, deployment clusters, resource optimization, and cross-cloud, cross-source data management and permission management.

4. Solution realization

(1) Minimalist data analysis

DataCake provides data users with a page to access any data source, facilitating the application, collaboration and sharing of data. Access data from databases, warehouses, lakes, and the cloud from a single point of entry.

In addition, DataCake adapts the best engine to the characteristics of SQL scripts and data source types, so analysts do not need to choose the corresponding calculation engine. It also supports the concept of Data as Product, which can share data in the form of APIs, and also supports the sharing of SQL code and templates.

(2) Low-threshold data development

DataCake templates commonly used development processes, and DataCake encapsulates more than dozens of common templates, covering the entire process from data access to data warehouse conversion to data distribution. With template-based development, business personnel can complete the entire data pipeline construction without developer intervention. In addition, DataCake also supports visual analysis of data lineage and ETL links.

(3) Unified data management

DataCake unifies the requirements in data management into one platform. Unified management, discovery, and monitoring of data from different sources such as data lakes, data warehouses, and databases, eliminating data silos, promoting data cooperation, and ensuring data quality and security.

On the one hand, DataCake supports multi-source data registration, which can contain relevant business information and data lineage. It also provides an entry point for data retrieval and query to meet the needs of data exploration.

On the other hand, it also provides federal-style data management functions, providing fine-grained permission management, detailed audit information and complete data quality monitoring. In this way, the data application needs of different business teams can be met under the premise of ensuring data security.

(4) Intelligent data governance

DataCake is a practitioner of public cloud-based data governance. From the three levels of observability, governance, and automation, it provides users with at-a-glance data assets and one-click governance experience, turning project-based data governance into a daily workflow.

(1) At the observable level, DataCake can provide fine-grained data information at the system, data, and business levels.

(2) At the governance level, the scoring and operation detection of computing tasks and computing resources are realized.

(3) At the automated operation level, DataCake productizes the governance process of professionals and intelligentizes the platform with AI/ML. Standardize, automate, and intelligentize data governance.

(5) Cross-cloud serverless

Because DataCake is a PaaS built on different cloud platforms. Therefore, it provides a multi-cloud cluster management optimization and computing deployment platform, which can select the type of virtual environment and the scale of the cluster according to the characteristics of the business. In addition, according to business scenarios, clusters, and application loads, DataCake can adaptively elastically scale resources, give full play to the elasticity of cloud resources, and bring objective cost reduction. Finally, DataCake can efficiently adapt to different types of computing instances, such as Spot and ARM, to reduce computing costs and improve computing performance.

DataCake plans for the future

1. Product level

The SaaS fully managed model will be launched on multiple cloud providers soon, so stay tuned.

2. Technical aspects

We will continue to build an open source and intelligent one-stop big data platform from the three levels of efficiency, intelligence, and openness, so that business data can play a greater value.

That's it for today's sharing, thank you.

▌2023 Data Intelligence Innovation and Practice Conference

The 4th DataFunCon Data Intelligence Innovation and Practice Conference will be ⏰ held in Beijing on July 21-22, with the theme of New Infrastructure and New Journey, focusing on four major systems of data intelligence: data architecture, data efficiency, algorithm innovation, and intelligent application. Here, you'll experience landscapes at the forefront of data intelligence technology practices.

Welcome to click the link below to get the conference ticket~

DataFunCon2023 (Beijing): Data Intelligence Innovation and Practice Conference - Baige Event