laitimes

From nucleic acid testing to health codes, why does the system always "crash"?

With the recurrence of the epidemic, the frequent collapse of the data platform system has attracted people's attention.

"I got up at 6 a.m., got in line at 6:30 a.m., started testing at 7 a.m., and finished at 7:20 a.m." Mr. Liu, a tianjin resident, participated in the nucleic acid test early in the morning as required, and when he saw that there were friends in the WeChat group complaining about the collapse of the nucleic acid testing system, he felt that the early rise on the 10th was worth it!

WeChat records show that some Tianjin citizens complained at about 10:40 on the 10th that they had to postpone testing due to the collapse of the nucleic acid testing system when the queue was about to arrive, and waited until 11:30 to restore the system. "Is there a white row?" Mr. Liu asked his friend, and the other party replied no, because he called out to his family to "take turns for a while".

Why is the official system so "fragile"?

In addition to the recent collapse of Tianjin's nucleic acid detection system, it is also the most commonly used health code, and the "collapse" rate is also the highest.

In 2021, Shandong, Xi'an, Tianjin and other places have all experienced failures. Most of the reasons disclosed afterwards were the system congestion caused by the surge in the peak query of the day. For example, shandong last August the highest query peak reached 609,600 person-times / minute, compared with the previous working day surge of 8 times, is the highest peak of last year's 2.5 times, Xi'an "one code pass" user traffic surge when the number of visits per second reached more than 10 times the previous peak, and Yuekang code traffic abnormally increased up to 1.4 million times per minute, exceeding the bearing limit, triggering the system protection mechanism.

CBN noted that compared with the number of permanent residents in the tens of millions, the number of visits per minute carried by these systems is in the millions.

From nucleic acid testing to health codes, why does the system always "crash"?

"The traditional approach is usually divided into two categories, one is to integrate multi-party data, and then provide services to the government system with a unified data resource platform, and the other is to deploy two systems to deal with the internal government service and the resident-oriented service system. The architecture of the former is prone to encounter bottlenecks in the face of application scenarios with high concurrency of residents; the latter may duplicate the construction of data resources. An industry insider told First Finance that the construction of these systems involves multiple professional manufacturers of the basic resource layer, the network layer, and the application layer, and the symptom of the problem must be the access crash, but the reasons behind it may not be the same, so it is not good to evaluate the system that has crashed.

At present, most of the operating companies of health codes and nucleic acid detection systems are built through local big data center bidding, and the technology providers can also be glimpsed from the shareholders. For example, "Yue Kang Code" is developed and technically maintained by Digital Guangdong Network Construction Co., Ltd., and the shareholders behind it include China Electronics, the three major operators and Tencent. Xi'an's One Code Pass is led by Xi'an Big Data Resources Management Bureau and developed and deployed by China Telecom Xi'an Branch. The reporter has contacted a number of technology suppliers who have business dealings with health data platforms or local big data centers, but have received a reply that declined to be interviewed.

However, the reporter learned that usually such a system will use distributed big data technology, combined with the local population situation, commuting peaks to design the corresponding system capacity and redundancy. "The coding business logic of the health code needs to be processed and integrated offline according to the relevant data of the operator's mobile phone, the relevant data of the public security population, the health status of the health commission personnel, etc., and the production data that reaches the second level is processed through real-time docking of user registration information such as the health cloud to quickly control the risk." Therefore, the team decided to use big data analysis and real-time stream processing engine to provide technical support for this business scenario. A relevant person who has participated in the construction of a health code system in a certain place said.

In addition, it is also necessary to measure the full link of the system in an all-round way to ensure that the reliability and stability of the system are high enough.

At the emergency level, the system should have sufficient auto scaling capabilities. In the event of a surge in access, the capacity can be rapidly expanded to meet the corresponding access needs, such as containerized design to ensure that the underlying infrastructure has good elastic scalability capabilities.

In addition to the consideration of access capacity redundancy, it is also necessary to consider the whole process monitoring design at the overall system level, such as network access, health code interface access statistics, daily peak occurrence period and other indicators. This not only provides corresponding data index support for the real-time large screen, but also provides a decision-making basis for the team to verify the system capacity design, monitor and ensure the stable operation of the system, and subsequently realize the dynamic expansion and contraction of resources.

Even if considered, there is certainly no perfect system in this world, especially if the computing resources themselves are limited. Industry insiders mentioned the need for a reasonable development process and operation management process to effectively support the continuous upgrading and healthy operation of the software system. For example, in the overall system design, the key software services and data are equipped with emergency resources and environments. Usually, this part of the resources can be used for some innovative business or non-critical business, once there is a temporary business need such as full nucleic acid, this part of the emergency resources can be used for business expansion in a timely manner to support.

This is somewhat similar to the 12306 system will be booking and query the remaining ticket business separately, in each subsystem on demand to expand the design, the same, it is not recommended to associate too many non-critical services on the key high concurrency health code query path, different services using microservices and other new software development methods to develop, combined with container cloud and other technologies to achieve dynamic on-demand scaling, while ensuring that the various services do not affect each other, such as uploading nucleic acid reports, query accounting reports, query health codes and other services to be separated.

"After some critical business failures, we can quickly recover data from the backup system and support the business to go back online, so that even if a service has a temporary problem, we can quickly recover the business and data through the disaster recovery system, so that the waiting time of the people can be reduced from one day to a few minutes." The industry insider said.

Read on