laitimes

Let's talk about the migration of Zhihu's order system

author:Flash Gene

This article mainly introduces the transformation and upgrading process of the back-end language stack of Zhihu's order system, including some pitfalls and problems encountered during the process. First, I want to provide reference experience for other application service transformation through this article, and secondly, to summarize the understanding of the order system. In view of the lack of writing skills, if you do not have sufficient understanding of the business, you are welcome to leave a message.

Migration background

With the changes in Zhihu's overall technology stack, the original Python technology stack has been gradually abandoned, and new Go and Java technology stacks have gradually emerged. The stability of Zhihu's trading system is much more important than the stability of other business systems, because the failure of the core link of the trading system will not only cause data problems, but also cause serious asset loss problems.

With the continuous growth and development of the company's business, the transaction scene has become complex, and it is difficult to avoid reconstruction and optimization, because of the language characteristics, although Python is very cool at the beginning, but the maintenance cost in the later stage is slowly increasing, but Python still has great advantages in the direction of data analysis and artificial intelligence, but it does not seem suitable in the trading field at present. From the perspective of technical ecology, it will be more advantageous to use Java as a trading system, so the next thing to talk about is the transformation of the language stack of Zhihu's order system.

Another factor is that Python's GIL lock prevents it from taking advantage of the advantages of multi-core, and its performance is very limited, and it has encountered multiple availability failures caused by the main thread being hanged, so it is determined to migrate the old system.

Preparation

If you want to do a good job, you must first sharpen your tools.

语言栈转型首先要明确转型的三个开发流程,即 MRO (Migration, Reconstruction, Optimization)

  • Migration is to copy the original language code into the new language project, and do it according to the engineering implementation style of the new language. In the meantime, code optimization and bug fixes are the most avoided, which will easily cause new problems and increase the difficulty of verifying the code.
  • The purpose of refactoring is to improve the maintainability and iteration of the project code, making the code more elegant and easy to read, and can be done after the migration is completed.
  • Optimization Reduce the complexity of the project and improve the rationality by adjusting the module dependencies, call relationships, interface fields, etc.

For the transformation of the language stack, the migration process must be done, and how to choose between refactoring and optimization, you can divide the functions into subtasks by module to evaluate the solution separately, and the reference is how much direct and indirect benefits the existing modules bring if they are optimized or refactored at the same time.

Let's talk about the migration of Zhihu's order system
  • Benefits: Complete the conversion of the old and new language stacks, the system is more maintainable, and the module boundaries are clearer.
  • Cost: The cost of labor that needs to be invested, the cost of parallel development during the migration process, and the loss of blocking work with higher value.
  • Risk: Introduce new bugs that increase the complexity of testing.

Under the premise of controllable risks, costs and benefits should be weighed against each other, and there are generally two options for reference: the first is to lock in demand, pile manpower development online, and do it in one step; The second is to take small steps, iterate online, and deliver in batches.

Let's talk about the migration of Zhihu's order system

Based on the above analysis, labor cost is a more important factor in this transformation process, so the migration-only solution is adopted to reduce labor costs, reduce the risk of bug introduction, and have good testability. And in order not to block business needs, we use small steps to deliver in batches, with a maximum of two weeks as an iterative cycle.

Migration scenarios

After determining the delivery method, we need to sort out the functional modules in the current system and do a good job of task splitting and scheduling. Before the migration, the business of Zhihu trading system was for the trading scenario of virtual goods, the transaction path was relatively short, and the process from purchase to consumption of content by users was as follows:

  1. Browse on the product detail page
  2. Generate orders to go to the cashier and users to pay
  3. The order is delivered after the payment is confirmed
  4. The user returns to the details page to consume the content
  5. Seven days for refunds on specific items with no questions asked

At that time, the order system did not support many functions, and the business model and order model were not abstracted enough.

Let's talk about the migration of Zhihu's order system

After the order module is split, how can the new and old systems be seamlessly switched? How to achieve business insensibility? How to ensure the stability of the trading system? How to stop the loss in time in case of failure? Based on the principles described above, the migration of the entire system is divided into two phases, and the data storage and model remain unchanged before and after the migration.

Let's talk about the migration of Zhihu's order system

Interface validation

No matter what stage of the migration you are in, you always need to adjust the order interface, which can be divided into read operation and write operation from the perspective of order operation, and different verification schemes need to be made for the read interface and write interface.

Write operations can be verified and launched through whitelist testing and grayscale scaling, and unexpected interface exceptions can be output to the IM tool for timely response. The main write interfaces are:

  • The interface for creating orders.
  • The API for submitting an order binding payment slip.
  • The callback confirmation interface after the user makes the payment.
  • The API where the user initiates a refund.

The following figure shows the traffic configuration interface of the AB platform:

Let's talk about the migration of Zhihu's order system

The following figure shows some of the transaction alert notification messages:

Let's talk about the migration of Zhihu's order system

Read operations are often accompanied by write operations. We use the recording and playback function of the platform to check the consistency of the interface, and find differences through comparison to troubleshoot problems. The main read interfaces are:

  • API for obtaining a list of payment methods
  • API for obtaining the fulfillment status of order payment
  • API for obtaining the recharge list
  • API for querying the status of new users in batches

The following figure shows the data dashboard of the traffic recording and playback system:

Let's talk about the migration of Zhihu's order system

Indicator combing

Monitoring is the "third eye" of our system, which can reflect the health status of the system in time, send out alarm information in time, and help us analyze problems and quickly narrow down the scope of troubleshooting when there is a failure. The monitoring of hardware, database, and middleware has been supported at the platform layer, and only the monitoring metrics of the application need to be sorted out here.

  • Log monitoring: request logs and server-side error logs.
  • Order business metrics
    • The amount of orders, the number of orders, and the number of dropped orders
      • Month-on-month data on the number of units
    • The number of first-time performance anomalies
    • The amount of compensation mechanism performance
    • Each notification event P95 takes time
    • Successful P95 fulfillment takes time
    • On-time fulfillment/success rate
  • Payment business metrics
    • Payment channel fulfillment delay P95
    • Payment Fulfillment Delay P95.
  • It takes the user to purchase the full P95.

Availability guarantees

To provide consistent availability assurance for SLAs before and after the transformation throughout the delivery process, look at the following metrics:

Let's talk about the migration of Zhihu's order system

Generally, the annual downtime of 3 9s is about 8.76 hours, and different systems and different user scales have different requirements for system availability, and the requirements for edge services may be lower, but for core link scenarios, TPS may not be high, but a high availability level must be guaranteed. How to ensure or improve the SLA of the service is what we will discuss next, and there are generally the following two influencing factors:

  • MTBF (Mean Time Between Failures) 系统服务平均故障时间间隔
  • MTTR (Mean Time To Recover) 系统服务平均故障恢复时长

This means that we want to reduce the frequency of failures as often as possible and ensure that we can recover quickly after failures. Based on these two points, when we make a smooth transition to the system, we should fully test all cases, and carry out grayscale scheme and traffic recording and playback, and immediately roll back when abnormalities are found, and then re-grayscale after locating and solving problems.

MTTR is responsive

Continuous monitoring

The first step in perceiving the stability of the system is monitoring, which reflects the health status of the system and assists in locating problems, and there are two directions for monitoring:

The first direction is indicative monitoring, where monitoring is arranged in the system code for various real-time dots, and the data is reported and presented through the configuration report.

  1. Machine monitoring provided by the infrastructure as well as response stability monitoring at interface granularity.
    1. Physical resource monitoring, such as CPU, hard disk, memory, network IO, etc.
    2. Middleware monitoring, message queues, caching, Nginx, etc.
    3. Service interfaces, HTTP, RPC interfaces, etc.
    4. Database monitoring, number of connections, QPS, TPS, cache hit ratio, master-slave latency, etc.
  2. Multi-dimensional monitoring at the business data level is divided from the perspectives of the client and the server.
    1. From the perspective of the client, monitor the interface success rate and payment success rate of the server.
    2. From the perspective of the server side, it continuously monitors the dimensions of single volume mutation, month-on-month change, and time consumption at each stage of the transaction.

The above two points are based on the company's statsd component for business checking, and the health status of the system is displayed in real time through the configuration of Grafana monitoring dashboard.

Let's talk about the migration of Zhihu's order system
Let's talk about the migration of Zhihu's order system

The second direction is log-based monitoring, which mainly relies on the company's ELK log analysis platform and Sentry anomaly capture platform. Through the Sentry platform, system alarm logs and new exceptions can be discovered in a timely manner, so that you can quickly locate the location of abnormal codes. The ELK platform can record key logs in detail to facilitate the analysis of the resulting scenarios and reproduction problems, which can be used to assist in fixing problems.

Abnormal alarms

Let's talk about the migration of Zhihu's order system

Based on the above real-time monitoring data, abnormal alarm indicators can be configured to predict fault risks in advance and send alarm information in a timely manner. However, what threshold is required to be alerted? What is the corresponding fault level?

First of all, we need to formulate relatively strict alarm indicators on the golden link of the transaction, and configure each link from ordering, bill of lading, confirmation of payment to fulfillment and delivery, and the severity of the configuration is divided into Info, Warning, and Critical in turn. Examples of alerting channels are provided by type of personnel and notification methods:

Let's talk about the migration of Zhihu's order system

A screenshot of the alert message in IM is as follows:

Let's talk about the migration of Zhihu's order system

The main warning points of the order are as follows:

  • The core interface is abnormal
  • Mutations in the rate of single drop and single rate
  • The time taken at each stage of the transaction increases
  • The time taken to fulfill an agreement increases after the user pays
  • The success rate of placing an order is too low

MTBF reduces failure rates

The system monitoring, alarming and logging system can help us quickly find and locate problems, and stop losses in time. The next quality improvement can help us reduce the incidence of failures and avoid losses, which is mainly explained from two directions:

Standardized acceptance scheme

Let's talk about the migration of Zhihu's order system

(1) After the development is completed, including logical functions and unit tests, priority is given to ensuring the coverage of the number of rows in a single test, and then to ensure the branch coverage. Then self-test in the joint debugging test environment, and submit the test to QA classmates after passing.

(2) QA students can perform functional acceptance and interface testing at the same time in the test environment, and deploy it to the staging environment after passing the test.

(3) Perform functional acceptance and pass in the staging environment.

(4) Grayscale delivery and double-read verification can be used selectively according to the actual situation.

(5) After going live, a final regression test is required.

Unified coding protocol and multi-round CR assurance

Generally, there are at least two code reviews before the code is launched, and MR that is too small can directly pull a colleague to CR at the workstation, and changes of more than 100 lines need to be discussed at the meeting, and the focus of the two reviews is also different.

The first review should pay attention to the coding style, so as to avoid some pitfalls caused by free play in writing, so as to precipitate a relatively unified coding protocol within the group, establish a basic consensus on the stability of the coding, and improve the quality of the code.

The second review should focus on the code logic, and it is important to note that if you explicitly only do migrations, then don't just optimize the old logic if you find it difficult to understand, because it is very likely that a bug will be written and brought online without knowing the background (this has happened several times). In addition, it is also good to compare and verify, and then optimize after the verification is launched.

Quality code can only be delivered faster and better by clarifying the purpose and process and following that process.

Consistency Guarantee

Each microservice has its own database, and the data consistency within the microservice is guaranteed by database transactions, which can be easily implemented in Java using Spring's @Transtaction annotations.

The distributed transactions across microservices, such as payment, order, and membership, adopt eventual consistency between the three microservices, similar to the two-stage submission of the TCC model, the order is generated through the global number issuer, and then the payment order is created based on the order ID, if the user pays after the order will change its own state and notify the member microservice, the transaction will end if the performance is successful, and the refund will be triggered if the performance fails, and if the user does not pay, then the order system will process the order and the payment order as a closing order.

Corresponding to the consistency guarantee, we have done two aspects of the order interface:

Distributed locks

For the upstream payment message listening, payment HTTP callback, and order active query payment result, the three synchronization mechanisms are locked based on the order ID before processing, ensuring that the synchronization mechanism will not be processed concurrently.

Interface idempotent

After locking, the order status is checked, and the response is successful if it is processed, otherwise the response is successful after processing, so as to ensure that the upstream message will not be processed repeatedly.

The fulfillment of the order for the downstream is implemented through the order ID as the idempotent key to ensure that the same order will not be fulfilled repeatedly, and the ACK mechanism is used to ensure that the fulfillment will not be repeated to the downstream.

Let's talk about the migration of Zhihu's order system

Among them, the distributed lock adopts etcd lock, which further guarantees the consistency of data through the lock lease renewal mechanism and the unique index of the database.

Although we use a variety of means to ensure the final consistency of the system, there will be many factors in the distributed environment, such as network jitter, disk I/O, database exceptions, etc., which may cause our processing to be interrupted. At this point we have two compensation mechanisms to resume our processing:

Delayed retries with a penalty mechanism

If the notification is interrupted, or if no downstream ACK response is received, the task can be placed in a delay queue for a limited number of retries, with the retry interval incremented. The last processing failure alarm is manually processed.

Scheduled tasks

In order to prevent all of the above mechanisms from failing, our bottom line is to regularly scan for orders that are abnormally interrupted and then processed. If the processing still fails, the alarm will be manually handled.

Post-mortem summary

Goal review

Goal 1: Unify the technology stack and reduce project maintenance costs. The target result is the offline of the old order system.

Goal 2: Simplify the order process and reduce the cost of terminal access. The target result is a unified interface on the backend and integration of SDKs on the end.

Execute the plan

In total, the migration is executed in three broad phases:

The first stage is the migration logic, which forwards the HTTP request initiated by the client to the RPC interface and is executed by the new system. In the first phase, all new functional requirements are developed on the new system, and the old system only needs to be maintained on a daily basis.

The second stage is to migrate and integrate all the current Zhihu order scenarios through cooperation with the client classmates, provide a unified order purchase interface, and at the same time, the client also provides a unified transaction SDK, and the new components are relatively more stable and monitorable, and were fully launched at the end of last year after grayscale expansion. With the release of the new version, the traffic of the old interface has become very low, which greatly reduces the risk of migration in the next stage.

The third stage is the migration of the old HTTP interface, in which the new system carries the requests of all ends, provides HTTP interfaces of the same specifications, and finally completes the interface migration by modifying the NGINX configuration. After the third phase of migration was completed, the legacy system was finally taken offline.

Execution result

As of the time of writing, 100% of the language stack has been migrated to the new system, and the old system has been completely taken offline, with a total of 12 system services, 32 external HTTP interfaces, 21 RPC interfaces, and 15 background HTTP interfaces.

According to the HALO metrics, the average time spent on interface P95 before and after migration was reduced by about 40%, and the hardware resource consumption was reduced by about 20%. According to the stress test results, the service capacity supported by the migration has increased by about 10 times.

The completion of the system migration is only a phased victory, and the system still needs to go through some minor surgeries to eliminate the lesions, mainly the following points:

  1. Continuously refine the monitoring granularity, optimize the alarm configuration, and continue to improve the stability of services.
  2. The hard translation of Python still needs to be constantly refactored and optimized, and here we draw on DDD design ideas.
  3. Improve the monitoring dashboard and optimize our processes through data-driven operations.
  4. Project review and summary and business popularization and publicity to improve personnel's understanding of business details.

Problem sorting

Migration is not always smooth sailing, and there are a lot of strange problems along the way, so the hair is really not missing.

Question 1: What should I do if I have migrated half of the new requirements and there is no manpower to make up for them?

In fact, a large part of the refactoring and optimization process after migration is due to the lack of manpower, and the current situation does not allow for locking in requirements. Then you can only write it twice, give priority to supporting the requirements, and then migrate later. If manpower is sufficient, one team can be selected to maintain the new system and one team to maintain the old system.

Question 2: I asked for it, but why doesn't the log come out?

Don't doubt the problems of the platform, but first find the problem from yourself. To summarize two reasons, one is that the migration points of the old and new systems are too scattered, resulting in difficult grayscale control, and the other is that the grayscale switch is forgotten to operate, resulting in the traffic not being successfully directed to the new system. One thing to note here is to deliver the go-live as quickly as possible during the migration process.

Question 3: What should I do if the company's Java basic services are not perfect and many basic platforms are not supported?

Therefore, I have developed components such as distributed delay queues and distributed scheduled tasks, so I won't talk about it here.

Q4: How can I ensure the consistency of data between the two systems during the migration process?

First of all, we talked about the system code migration, and the data storage is unchanged, that is to say, there will be competition between the data processed by the two systems, and the solution is to add distributed locks to the processing, and the processing of the interface is also idempotent. In this way, even when the upstream and downstream systems are synchronized with data, competition can be avoided and data consistency can be ensured.

As far as the mechanism of synchronizing the payment result to the order system after the user pays, the push-pull mechanism is adopted.

(1) After the user pays, the order actively polls for the payment result, which is actively pulling data.

(2) The payment system sends an MQ message that is listened to by the order system, which is a passive push.

(3) The HTTP callback mechanism of the order system triggered after the payment is successful, which is also a passive push.

The combination of the above three mechanisms makes our system data consistency have a relatively high guarantee. We need to know that a system is by no means 100% reliable, and as the core link of transaction payment, there need to be multiple mechanisms to ensure the consistency of data.

Q5: What happens if a user doesn't receive their membership benefits after payment?

In the transaction process, order, payment, and membership are three independent services, and if the order loses the payment message or the member loses the message of the order, the user will not receive the membership benefits. In the previous question, we have mentioned the eventual consistency synchronization mechanism, which may cause the message to be unable to be synchronized due to middleware or network failures, so you can add another compensation mechanism to scan the unfinished orders through the scheduled task, and actively check the payment status before going to the member business to fulfill the contract, which is a back-up strategy to ensure the eventual consistency of data.

Business precipitation

From receiving the project to the present, it is also a process of gradually deepening the understanding of the order system from ignorance, and also has an understanding of the business and business structure of the current transaction.

As the upper-level system of the payment system, the transaction system itself provides commodity management capabilities, transaction acquiring capabilities, and performance verification capabilities. The peripheral business subsystem mainly focuses on the management of business content resources. The acquiring and fulfillment management of the business can be connected to the transaction system, which can reduce the complexity of business development. The acquiring process is shown below:

  1. Customize the product detail page for your business, and then call the terminal capability through the bottom bar of the detail page to enter the order cashier. Here, the client needs to call the business backend interface to get the store details, and then call the display interface of the transaction bottom bar to get the bottom button.
  2. After the user enters the cashier through the bottom button, they can select the payment method and coupon at the cashier, click to confirm the payment to call up WeChat or Alipay payment. The interface for displaying the cashier and obtaining the payment parameters is provided by the trading system.
  3. After the order background confirms the payment, the business performance will be notified, and the user will return to the details page, and the user will enter the content playback page on the details page to enjoy the benefits. The performance verification process is completed by calling the interface between the backend of the business and the backend of the transaction system.
Let's talk about the migration of Zhihu's order system

Now Zhihu is mainly engaged in the trading of virtual commodities, and a general transaction process is as follows:

Let's talk about the migration of Zhihu's order system

Users go through the process from browsing the product to entering the cashier to place an order and pay, and then returning to the content page to consume the content. With the development of the business, different transaction scenarios and transaction processes are superimposed, the system begins to become complex, and the business architecture of a transaction is slowly revealed.

Let's talk about the migration of Zhihu's order system

The order system mainly carries various trading services inside and outside Zhihu station, providing stable and reliable trading scenario support. It is mainly divided into the following parts:

  1. First of all, the product service layer is an interactive interface that users can feel, and provides a unified order payment API gateway for these pages.
  2. Then there is the order service layer, which is called by the upper-layer gateway and provides transaction service support in different scenarios.
  3. Further down is the order domain layer, which carries the core logic code of the order, which is first the calculation aggregation required by the user to purchase, then the transaction aggregation that manages the order model, and finally the delivery aggregation of the fulfillment processing after the purchase of the goods.
  4. The lowest level is the basic support service layer, which mainly provides basic service support and some services that transactions rely on.
  5. Finally, there is the operation service, which provides transaction-related back-office function support.

Methodological practice

All of the above comes down to the understanding and cognition of the participants, and an excellent solution or appropriate architecture is not designed, but iterative. The same is true of human cognition, which requires continuous iterative upgrading, and like many methodologies, the PDCA cycle refines a path for us to improve.

Let's talk about the migration of Zhihu's order system
  • Plan plan, clarify the goal of our migration, investigate the current situation and specify the plan.
  • Do to implement what is planned.
  • Check, summarize, analyze what went well and what went wrong.
  • Actions are adjusted, lessons learned, and solved in the next loop.

Many times, maybe you only did the first two steps, but in fact, the last two steps will help you a lot. Therefore, a review of a project is very important, and it is easier to break your inherent thinking and improve your business cognition through language communication and collision.

Refer to the article

https://mp.weixin.qq.com/s/eKc8qoqNCgqrnont2nYNgA

https://zhuanlan.zhihu.com/p/138222300

https://blog.csdn.net/g6U8W7p06

Author: Zhiyi

Source: https://zhuanlan.zhihu.com/p/383640330

Read on