Thinking and practice of reconstructing the domestic game account login system

background

As one of the most important applications of the game publishing platform, the account login system mainly carries the responsibilities of the user's account registration, login, real-name, anti-addiction, privacy compliance, and risk control in the application architecture of the current distribution platform. Compliance is the lifeline of enterprise operation, and at the same time, account login is the first stop of online link transformation, so the stability of the account login system has always faced extremely high requirements.

Due to the need for stability, the game publishing platform implemented the multi-active architecture of two cities and three centers at a very early stage. At present, with the company's computer room as the center, it has realized the deployment scheme of two cities and three centers in the East China public cloud and the South China public cloud at the same time. The main consideration for relying on the public cloud is that the rapid elasticity and pay-as-you-go capabilities provided by the public cloud in the early stage can efficiently undertake the development demands of game business owners, and secondly, for the selection of South China, the geographical location of important partners is also prioritized.

Based on the considerations of stability, efficiency, and cost, the hybrid cloud architecture of the company's computer room, East China Public Cloud A, and South China Public Cloud B was finally realized. While hybrid cloud architectures bring convenience, there are also common challenges, the most obvious of which are two aspects:

First, the data architecture, the risk of leakage of data stored in the cloud. At the specific implementation level, the difference in data architecture will lead to differences in the APIs that can be delivered by the underlying domain services, and the main differences are reflected in the capabilities of APIs. For example, the cloud computer room cannot actually support capabilities such as bilibili authorization and login, and the server can only forward relevant requests to the company's data center for processing.

Second, the differences in PaaS platforms. The company's infrastructure is quite mature in the database management platform, KV data management platform, message queue and other product support, but the early application deployment in the public cloud can only rely on cloud-native capabilities. As a result, it was originally necessary to establish a standard anti-corrosion layer to shield specific implementation differences, but due to the needs of business progress, there was a shortcoming.

Due to the above two significant challenges, project progress and other factors, the final evolution of the application architecture is as follows, so the account login system has evolved two sets of code repositories (login-idc-api / login-cloud-api) in the company's computer room and public cloud computer room.

Thinking and practice of reconstructing the domestic game account login system

The two sets of codes for logging in to the application with a game account have been iterated for nearly 7 years. Although the current application has been iteratively stabilized in the subjective understanding, based on the statistics of iterations completed in the last year, there are more than 10 iterations of the version throughout the year, and the average time of the version is nearly 4 weeks, including design, development, testing, and launch. The average time spent on design, development, and testing and deployment caused by dual-code repositories has increased by 30%~40%. Secondly, under the premise of a predictable workload of aligning the various products of the company's infrastructure, the refactoring of the system has reached the point where it is necessary to do so.

challenge

It takes a lot of courage to refactor a system with extremely high stability requirements and long-term iteration, and the gains and losses are instantaneous.

Importance: Carries tens of millions of MAUs, L0 non-degradable links, supports 113 SDK versions (sampling period: 2024/02/15 - 2024/03/15)
Stability: In the process of refactoring, the SLO 4 9s must be planned and completed before the implementation of the implementation details and process, which is comprehensive and detailed;
Complexity: The intrinsic complexity of the business itself, the rapid development in the past 7 years, and the accumulation of a large amount of technical debt.

value

1. Efficiency: 50% increase in development efficiency and 30%-40% increase in iteration efficiency

2. Cost: Labor savings (the life cycle of the current system is expected to be at least 5 years)

3. Quality:

Code circle complexity: reduced by 40%;
Intranet calls are de-dependent, and the core link interface RT 99.9Line is increased by 31.5%.

4. Culture: Practice the value of "extreme execution".

Practice: How do you refactor?

Strategic Direction

Solution 1: Abandon the existing two cities and three centers, and unify all account logins to the company's standards, without relying on cloud vendors. As a long-term direction in the future, the solution is fully deployed back to the company in line with the company's long-term strategic planning, and the difficulty lies in the application architecture of the distribution platform.

Solution 2: Promote the master account team, open up the data architecture of the company and the public cloud, and achieve the equivalence of domain capabilities, but there are data security risks, considering the project benefits and cross-team work costs, it is not the current choice with the highest ROI.

Solution 3: The differences between the data architecture and PaaS in the dual-code repository are compatible with each other by refactoring to achieve the final unification of the code repository.

Thought process

The premise of refactoring

Looking back at Martin Flower's book on refactoring, "Refactoring: Improving the Design of Existing Code", "refactoring" is defined as a noun and a verb as follows:

Refactoring (noun): An adjustment to the internal structure of the software in order to improve its comprehensibility and reduce the cost of modification without changing the observable behavior of the software.

Refactoring (verb): Using a series of refactoring techniques to adjust the structure of the software without changing its observable behavior.

This classic on refactoring, in two ways of definition, mentions three key concepts, "adjustment", "structure", and "not changing the observable behavior of the software". Taken together, these three keywords are approximately equal to a guide to action. With the premise of "not changing the observable behavior of the software", Martin has not made a clear definition of observable behavior in terms of technology and business, so the understanding of this concept inevitably requires more personal understanding and thinking of the executor.

If you think of the application architecture as the current structure, try to think about the understanding of "observable behavior" from the following perspectives:

1. System responsibilities, under the application architecture of front-end and back-end separation, the delivery of APIs for these responsibilities is a class of observable behaviors;

2. System dependencies are a class of observable behaviors that include two major parts:

Bilibili account domain services, game publishing domain services, third-party services (e.g., Geetest service providers)
PaaS infrastructure (company databases, KV data systems, message queues, etc.; public cloud MySQL, Redis, Kafka, etc.)

3. Horizontal service of the system. From the level of hierarchical architecture, the current positioning is the application service layer BFF, which should not have a large horizontal impact, but in the process of rapid evolution in the past, it bears the responsibilities of some domain services (such as: querying game information, querying active games, querying user information, etc.), so this category can be summarized as horizontal impact.

There are all sorts of differences

Under the current application architecture, the characteristics of the dual warehouse in logical modules and intra-application layering (the three-layer model of J2EE) are shown in the following figure:

Under the premise of immutability, trying to find a unified idea by aligning the code differences between the two repositories, from the Service layer alone (it relies on more than 80 bilibili account service APIs), it is enough to understand, and the change risk is also extremely huge, not to mention the differences in the technical selection and use constraints of JDBC ORM and Redis. Such risks, combined with the large amount of work that will be invested, will inevitably make the proposal difficult. Based on the static module and in-app code layering, this huge difference seems to be impossible to "brute-force" erase.

Disappearing complexity

The bright moment comes from a fact in the application runtime: "multiple high-availability cuts in the production environment". This fact not only fulfills the promise of remote multi-activity upstream of the SDK, but also has been verified many times in the production environment. In short, the two warehouses have their own merits in the controller and service, the two layers with the most complex business logic and the largest accumulation of technical debt.

It follows from this that the maximum cognitive attention burden in the reconstruction process can be directly ignored from these two layers. As the two most important components of the infrastructure, there is no need to pay attention to the logical differences between various RedisTemplate/Jedis, Key naming, JdbcTemplate/Mybatis, and SQL, because they are only part of the final presentation of Controller and Service.

Focus in the chaos

SDK API link: Due to the disappearance of the complexity of the controller and service, the remaining immutable observable behaviors are only concentrated in the DAO. The essence of the DAO difference comes from the challenges of the hybrid cloud deployment architecture (data architecture, PaaS difference), so the implementation strategy is clear, and the API call of the dependent bilibili account can be identified based on the runtime ZONE at the DAO layer. Referring to the hexagonal architecture, the isolation of external dependencies and business logic realizes the decoupling of core business logic from external APIs and PaaS, so as to reduce the impact of change intrusion.

The differences in PaaS have been digested in the business logic layer of Service (or a small number of Controllers), so there is no need to consider the logical differences, just focus on the version compatibility of the framework and the database.

Domain and notification API links: Although this link is not the core, it is a reflection of the differences in hybrid cloud data architectures. The call of the production environment is concentrated in the company's computer room, and the public cloud applications with "smaller" capabilities are closer to the company's application integration.

Forwarding API link: The essence of this link differentiation also comes from the difference in the data architecture of hybrid cloud deployment, which cannot be solved in the current convergence scheme, so the existing forwarding strategy can only be used, but the forwarding can be transferred to the API GW implementation in the future, which is planned.

Concrete implementation

Based on the original warehouse of the company's computer room, it is deployed in the public cloud A/B availability zone, and the availability zone selection is consistent with the current three data centers in two cities. The dependent DB schema is the same in the three data centers, so existing DB instances can be directly reused. For Redis, because all Redis operations are performed to avoid data contamination by the grayscale process to online applications, a separate Redis cluster is deployed for data isolation, and after the grayscale is completed, the original application and the original Redis cluster are taken offline together.

Data validation

Product perspective

Based on the user dimension, verify and compare the data writing and query of the new and old clusters.
Based on the game dimension, verify and compare the data writing and query of the new and old clusters;
The production environment SDK implements regression verification of core SDK products based on the test domain name, covering all use cases.
Other non-SDK APIs can be used to achieve full single-test coverage and acceptance-based APIs.

Data perspective

Database: Based on the job task, the content of the source and the target data source can be compared.
Cache: In a pure cache scenario, the deployment is isolated during grayscale and does not affect the business logic after rollback. Direct remote call for cache misses;
Buried point/report: Observe the report trend of the game dimension based on the grayscale process and compare it with past data.

Publish the scenario

The release plan is designed to strictly comply with the company's safety production requirements: grayscale, observable, and recoverable.

Grayscale

The grayscale process is divided into two large steps:

Step1

Perform grayscale deployment in the company's data center, deploy and import production traffic in batches, and immediately roll back if any abnormalities are found.

Step2

In public cloud A, you can configure SLB rules to make a rule-level SLB release plan based on the domain name, the importance of the rules, and the magnitude of APIs (call volume/week). Repeat this step for public cloud B deployments.

Observable

The main dimensions of observation: business and SLO, logs, and performance

Observation 1: Distribution and success rate of traffic in the original public cloud A application after slicing

Observation 2: Distribution and success rate of new public cloud A application traffic after traffic switching

Observation 3: Distribution of application error codes in new public cloud A

Observation 4: API performance of application A in the new public cloud

Observation 5: JVM performance of the new public cloud A application

Recoverable

1. The company's computer room application performs grayscale deployment, and if it encounters an abnormality, it will be rolled back immediately without generating dirty data

2. New public cloud A/B applications are released in batches based on SLB rules through SLB, and if an exception occurs, the traffic will be directed to the original corresponding data center service cluster, and the dirty data generated by Redis writing will not affect the business logic after rollback

3. Apply for a separate Redis instance in a new public cloud zone to isolate data from different code repositories and different key naming styles to avoid the impact of dirty data on the original zone services during the rollback process.

Resources

"Practice of Special Construction of Safety Production at Station B"
《重构改善既有代码的设计》Martin Flower
《Complexity Has to Live Somewhere》

Author: Rich

Source-WeChat public account: Bilibili Technology

Source: https://mp.weixin.qq.com/s/ZL1fsBQUlCRKE0nEQm8Msg

Thinking and practice of reconstructing the domestic game account login system

Read on