About the Author
Kongge Baishi, senior R&D manager of Ctrip, focuses on performance and efficiency improvement, and architecture optimization.
Team Recruitment: Back-end Development
This article provides an overview of the design scheme of Ctrip's air ticket in the basic data processing. The first part expounds the background and challenges of the middle platform, the second part introduces the principles and goals of the middle platform design, the third part introduces the key technical practices of the middle platform architecture in detail, involving the exploration and attempts of data consistency, data timeliness, system robustness, system automation, etc., and finally the overall technical architecture overview and online operation effect. I hope the above can be helpful and inspiring to everyone.
- I. Background and Challenges
- II. Principles and Objectives
- 3. Key technical practices
- 3.1 Data Consistency
- 3.2 Timeliness of data
- 3.3 System Robustness
- 3.4 Consumption process optimization
- 3.5 Unified Data Governance
- Fourth, the overview of the technical architecture
- Fifth, effectiveness
- 6. Future plans
I. Background and Challenges
With the rapid development of Ctrip's international air ticket business and the deepening of its globalization strategy, the types of data that need to be used and the corresponding complexity have also increased significantly. This growing trend not only brings challenges to data management, but also puts forward higher requirements for data accuracy and real-time.
Next, let's look at the challenges from both producer and consumer perspectives.
Challenges from the perspective of producers: For producers of basic data, each data type has its own unique business logic, covering data acquisition, processing, storage, and matching. The data update cycle is long, the business complexity is high, and the number of applications related to basic data has reached dozens of levels, and the maintenance cost is rising. At the same time, there are problems such as low data access efficiency, complex cloud migration process, difficulty in data rollback, and inability to cope with large-scale machine restarts or data refreshes.
Challenges from the consumer perspective: For consumers of underlying data, every application relies on the underlying data to a greater or lesser extent. Errors in any one kind of data can lead to a wide range of online problems, and problems are often found with a lag in occurring, leading to production problems being amplified. In addition, there are problems such as complex service development and access, unstable test environments, slow service startup, frequent garbage collection, cumbersome cloud migration process, and inconsistent data.
II. Principles and Objectives
In the face of the above problems, building a set of middle platform system is one of the solutions, we hope to achieve efficient integration of data resources, eliminate data silos, improve data processing efficiency, and ensure data quality standards. In the early stages of system design, we first established a set of core goals from the perspective of data producers and consumers:
From the perspective of a data producer:
Data consistency: Ensure data consistency in all aspects and avoid bias.
Data timeliness: Ensure real-time data updates to meet business requirements for data immediacy.
System Robustness: Build a stable and reliable system architecture to cope with various operating environments and load conditions.
- Data traceability: Realize the whole process of data tracking, which is convenient for problem location and historical analysis.
- Data rollback: Provides data versioning that allows for quick fallback to a steady state in the event of a problem.
- Monitoring completeness: Establish a comprehensive monitoring system to monitor data flow and system status in real time.
Reduce costs: Reduce the cost of machines and storage by optimizing resource allocation; Simplify the system maintenance process and reduce the investment of manpower and time.
Data Consumer Perspective:
Optimize the consumption process: Simplify the data consumption process and improve the convenience and efficiency of data processing.
- Simplified access methods: Provides intuitive and easy-to-use access methods to lower the threshold for data usage.
- Unified data model: Establish a unified data model to ensure data consistency and comprehensibility.
Solve environmental problems: Solve the problem of data synchronization in different operating environments.
- Complete test environment: Provide a complete test environment to ensure the accuracy and stability of data.
- Cloud service convenience: Optimize cloud service access and improve the flexibility and scalability of data services.
Improve service performance: Improve service response speed and processing capacity through technical optimization.
- Reduced startup time: Reduces service startup time and improves the system's rapid response capability.
- Reduce GC times: Optimize memory management and reduce garbage collection (GC) operations caused by data updates, improving system performance.
3. Key technical practices
In the practice of basic data middle-platform construction, a series of problems have been encountered, and some key technical practices will be introduced in this chapter.
3.1 Data Consistency
3.1.1 Versioning
To solve this problem, we use a data versioning policy to solve this problem due to inconsistent scheduling times for different machines in the same cluster to update the cache. Every time the data changes, we assume that a new version of the data has been created. This new version can be a complete copy of the changed and unchanged data ("Full Data") or an update containing only the changes ("Incremental Data"). Regardless of whether it is full data or incremental data, we will record the data in the BLOB (Binary Large Object) file, and the BLOB file will be used as the data transfer medium.
With data versioning, we've been able to reap the following benefits:
Version tracking: Make sure every data update is traceable and can quickly revert to the previous version if something goes wrong with the new version.
Data consistency: Ensure that all data consumers can access the same version of data, thereby reducing problems caused by inconsistent data versions.
Fault tolerance and recovery: Ability to quickly identify problem data and use multiple notification mechanisms to prompt the product or development team to intervene in time to solve the problem.
Performance monitoring: A performance monitoring system is built based on the data version to evaluate the efficiency of data transmission and processing.
Data security: During data transmission and storage, encryption and access control mechanisms are used to ensure data security and prevent unauthorized access.
3.1.2 Decentralization
In the industry, if a service wants to consume basic data, the mainstream solution is to directly connect to the database or call the application API, as shown in Figure 1 and Figure 2 below, both of which belong to the C/S architecture. Although the C/S architecture is widely adopted, it faces problems such as high pressure on databases, difficulty in ensuring data consistency, insufficient system scalability, complex development and integration processes, high hardware costs, cache penetration, read pressure on central services, and difficulty in coping with traffic peaks.
Figure 1
Figure 2
To overcome the above problems, we have introduced the Peer-to-Peer Architecture. Compared to the C/S architecture, the P2P architecture is decentralized. In a P2P network, each node has a dual identity as a client and a server, and is able to communicate and exchange data directly with other nodes without relying on a central server. At the same time, the data query time in the P2P architecture does not increase linearly due to the increase in the number of clients. Figure 3 visually illustrates the essential difference between the C/S architecture and the P2P architecture.
Figure 3
Based on the BLOB file versioning mechanism mentioned above, we chose the BitTorrent protocol as the file sharing and distribution standard in P2P networks. BitTorrent is especially suitable for the rapid distribution of large-scale data due to its efficient data transmission capability, as shown in Figure 4, the client can control whether to join the peer network according to actual needs, as shown in Figure 4, the red machine only downloads the data and does not share it with other peers.
Figure 4
3.1.3 Data Portfolio Strategy Optimization
In business practice, we sometimes need to deal with the combination of multiple basic data, for example, if we want to consume the data of the continent where the country is located when consuming the country data, when the data versions in the combination are inconsistent, the data will be inaccurate.
To address this challenge, our approach is to consolidate all associated combined data into a single torrent file through business rules, which can be downloaded to all of them at once. The advantage of this integration approach is that it allows consumers to get the entire collection of data at once, without the need to collect and reconcile data from different sources or versions. In this way, we not only simplify the data acquisition process, but more importantly, ensure the consistency and accuracy of the data.
3.1.4 Single-point data generation strategy
In order to ensure the consistency of data on the consumer side, we use a single-point data loading mode when generating full and incremental data, as shown in Figure 5. Only a single machine is responsible for querying data from the database, which significantly reduces the read pressure of the database by spreading the pressure to 1/n of the number of machines on the consumer side (n represents the total number of machines on the consumer side), and on the other hand, combined with the versioning strategy, only one valid version of data can be used in the middle office system at any one time.
Figure 5
In addition, different kinds of data are updated at different rates, often depending on specific business needs. As a result, the middle office system provides the ability to customize data generation configurations, and the benefits of this flexibility are:
Reduce unnecessary data loads: Waste of resources is avoided by precisely controlling how often and when data is generated.
Reduced client-side cache refresh pressure: Client-side garbage collection (GC) operations due to frequent data updates are reduced, resulting in improved client-side performance.
Alleviate network bandwidth pressure: By optimizing data generation and transmission strategies, the bandwidth burden of the entire network is reduced and the efficiency of data transmission is ensured.
3.2 Timeliness of data
3.2.1 Push-pull mechanism
In order to ensure the timeliness of the whole process from data production to consumption and prevent business losses caused by data update lag, we have introduced a push-pull engagement mode in the architecture design. After the pre-sequence system completes data processing, it sends a notification to the subsequent system through the message middleware, and this coherent process is triggered in each subsystem in turn until the data is consumed.
In order to improve the reliability of the process, we have introduced a configurable pull state logic based on the scheduled task middleware to ensure that any delay or exception in the process can be compensated in time.
Figure 6
3.2.2 Data Cloud Migration
With the continuous expansion of Ctrip's international business, our system architecture is gradually transforming to a hybrid cloud model. Traditionally, migrating basic data to the cloud is to replicate databases to the cloud, which is feasible, but often brings problems such as rising costs and delays in data replication. In order to meet these challenges, we distribute blob files to different regions, avoiding dependence on expensive database instances, and reducing the cost by more than 98% on the premise of ensuring data timeliness. Figure 7 shows the specific architecture.
Figure 7
3.3 System Robustness
3.3.1 Data Checksum Interception
In the lifecycle of data, whether it is data maintained by the business or provided by external data providers, data errors are always a problem that cannot be ignored. These errors manifest as illegal characters, missing data, duplicate data, and so on. To effectively solve this problem, we have developed a data validation mechanism that performs rigorous compliance checks for each field of the data. Once the system detects abnormal data, it will immediately send an alert to the data owner through various communication channels such as TripPal (Ctrip's self-developed IM system), email, SMS, etc., so that the problem can be quickly responded to and handled.
Before the problem is resolved, the system will automatically pause the update of the problematic data to prevent the spread of erroneous data. In addition, in order to further improve the accuracy of data verification, we combine statistical algorithms and artificial intelligence (AI) prediction models to analyze data changes and make intelligent judgments. The specific architecture is shown in Figure 8.
FIGURE 8
3.3.2 Data Rollback
In a production environment, a rollback operation is the quickest and most effective response to a sudden system failure. The middle platform system provides a set of data rollback functions for this purpose, which treats the rollback version of the data as a completely normal version, and through the portal interface of the middle platform, users can trace and query the required historical data version according to the timestamp. The entire rollback process does not require any data-level modifications to the database, which improves efficiency and security compared to the rollback method that relies on binlogs. See Figure 9.
Figure 9
3.4 Consumption process optimization
3.4 1 Unified Data Model
The data model will have scenarios where fields are added, modified, and deleted, which is usually implemented through coding, which is not only a cumbersome process, but also significantly increases the maintenance cost of the system over time. In order to reduce the complexity of consumption data on the consumer side, it is necessary to support the automated production and consumption of arbitrary data models. We automate the build and deploy JAR packages through scripting to simplify the process. The specific process is shown in Figure 10.
FIGURE 10
3.4.2 Simplify access methods
When you need to consume a certain amount of data, you can import an independent model package generated by the Figure 10 process to obtain the full dataset or the dataset that can be queried by conditions by using the following code:
// 引入客户端 @DataResource private CityClient cityClient; // 全量查询 List<City> list = cityClient.queryList(); // 条件查询 List<City> list = cityClient.queryList(cityCode);
3.5 Unified Data Governance
In practice, development teams and product teams often face a common challenge: how do you find the basic data that is actually used in production in a wide variety of basic data? Generally, it will be solved by word of mouth or maintaining shared documents, but this method is not only inefficient, but also easy to cause production problems due to human error or untimely information update. Users can easily search for the required data types according to the search conditions such as database name, data table name, and interface name in the Portal, and if they exist, they can directly refer to them. If it does not exist, you can add a new one based on your specific business needs.
For the new service access platform, we have also optimized the process. After optimization, access to the middle platform only needs to adjust the data source, and no additional development work is required. This improvement significantly reduces the complexity and workload of service access, and minimizes the access process. Compared with directly connecting to the database or calling the application API, the middle platform system has achieved more than 90% improvement in data access efficiency.
Fourth, the overview of the technical architecture
On the basis of the above, this section expounds the key technical implementation of the platformization of Ctrip's international air ticket basic data from the perspective of macro system architecture. Our system is carefully designed as a number of core modules that work together, each with specific responsibilities, and together form a central platform for data processing and distribution.
1) DataSource module: As the starting point of the data flow, this module is responsible for the initial writing of data and ensuring data consistency. It triggers the data manipulation process through a timed task or message notification mechanism.
2) BlobGenerator module: focuses on the data production process and provides comprehensive services, including data verification, BLOB file generation, version control, and interception of rollback operations.
3) BlobService module: As the core of data distribution, it processes data requests from DataClient, acts as a bridge between BlobGenerator and DataClient, and ensures smooth and efficient data delivery.
4) DataClient module: responsible for the consumer side of data, providing a variety of functions including BitTorrent download, cache management, and support for accurate query to meet the needs of data use in different scenarios.
5) DataQuery module: provides an API query interface for consumers who cannot obtain data through BitTorrent download, and supports advanced functions such as full data output, conditional filter output, and logical calculation.
6) Dispatcher module: As the scheduling and coordination center of the system, it ensures the orderly execution of tasks of DataSource, BlobGenerator, BlobService and other modules, and ensures the smooth and synchronous data processing process.
Through the close collaboration of these modules, our technical architecture not only improves the efficiency and accuracy of data processing, but also enhances the scalability and maintainability of the system.
FIGURE 11
Fifth, effectiveness
The data middle platform covers the entire life cycle from data generation to consumption. Systematic data governance initiatives improve the performance of data systems in terms of governance, cost control, security, and operational efficiency.
From the perspective of a data producer: We sorted out and optimized the scattered business processes, and integrated them into the middle office in batches according to priorities. In the process, we not only refactored the business process, but also identified and solved a number of previously undiscovered issues by rewriting and optimizing them. The efficiency of data distribution has achieved a qualitative leap, with the average distribution time reduced to 23 seconds, and for small-scale data, we have achieved end-to-end fast transmission within 5 seconds, which greatly improves the real-time and freshness of data. The overall server cost has been reduced by more than 95%; The maintenance cost of the system is reduced by 66%.
From the perspective of data consumers, the access efficiency of new data sources has been improved by 90%, and no special transformation is required in the cloud migration process, which accelerates the R&D progress and improves the development efficiency. It solves the problem of data synchronization in different operating environments and reduces the dependence on the production environment. The scheduling strategy is optimized, and more than 98% of invalid scheduling tasks are reduced, and the frequency of GC is reduced.
6. Future plans
In the future, we plan to carry out in-depth iteration and optimization of the data middle platform from the following key aspects:
1) Automation: further improve the automation of the data processing process, reduce manual intervention, improve the overall efficiency, especially the use of large language models and other technologies for data verification and enhance data accuracy.
2) Stability: Strengthen the stability of the system to ensure the reliability of the data middle platform in high concurrency and large data processing scenarios.
3) Robustness: Build a more robust system architecture to improve the system's fault tolerance and self-recovery ability to abnormal situations.
4) Timeliness: Optimize the data update and distribution mechanism to ensure the real-time and timeliness of data.
5) Visualization: Through visualization technology, the data flow and processing process can be visually displayed, and the readability and ease of use of data can be improved.
- More user-friendly Portal interface: Design and implement a more user-friendly Portal interface to improve user experience and simplify user operations.
- Visualization of the processing process: Visualize the data processing process, so that users can clearly track each link of data processing.