The author | Bai Sijian
Edited | subsections
This article is compiled by Geek Time from The Speech of Bai Sijian, a technical expert of HUAWEI Cloud platform, at QCon+ Case Study Club, "Practice of Debt Measurement transformation of tens of millions of scale legacy systems".
Let's talk about large-scale legacy systems today. What are the characteristics of the legacy systems of large factories? The first is that the stock code is particularly large, followed by the technology stack is particularly complete, and finally, the evolution of the architecture is long, and the ancestral legacy system of 10 to 20 years is very common.
Ancestral code
The architecture of a legacy system is usually shown in the figure above: it can be fully operational and the overall system function is relatively stable. However, due to insufficient evolution due to outdated technology stacks, outdated or lost documentation that leads to disconnection from the real system, the original designer may have left the company, or been promoted to management, and the chain reaction of easy changes can easily lead to the instability of the entire system. In the face of such legacy systems, the average person will be more cautious, and technicians are more willing to rewrite rather than do the transformation of legacy systems.
However, in terms of cost, retrofitting is more costly than reinventing the short-term cost (analyzing problems, formulating strategies), and the long-term cost is much lower (changes and loss of specifications lead to customer complaints). Given the cost of risk, I prefer to retrofit legacy systems rather than refactor. The rewrite process faces the double delivery pressure of demand, the lack of document specifications is easy to lose, in the testing stage just to complete the specifications need a long time, technicians are easy to fall into the technical complex, the promised benefits before rewriting are often difficult to cash in, the replacement of architecture and programming language is very easy to cause team instability, the architect usually chooses the language he likes, the programmer usually chooses the language he is good at, just choosing the language is a difficult homework, usually the choice of App development is Java, Scala, Go, high-performance system development choices are Rust, C++, while glue scripting language development choices TS/JS, Python, Lua.
If you're a confident architect who chooses a more challenging legacy system retrofit, you should know what bad debts this legacy system has before you do it, which are called technical debts. You have to figure out not only what debts they have, but also what the root causes of the debts are. Architecture and code represent the engineer's understanding of the architecture, code, business, and environment at the moment of writing, and with the advancement of technology, environmental changes and technical debt naturally arise, which is a benign debt, or an unavoidable debt. Another type of debt may be the product of a compromise that iterates on rapid iteration in pursuit of business value, and if there is no investment in refactoring, or if the team lacks the genes for refactoring, processes and tools need to be improved. In my experience, the reduction of debt needs Lei Feng, and the team culture requirements cannot make Lei Feng suffer losses.
Characteristics of legacy systems of different sizes
Scale system
Different scales of management face different problems, face different problems, and adopt different methods. If you manage a microservice with code scales of around 20K, I believe you can quickly transform it to its best through your personal ability. If you're a technical lead, managing about 200K code for 10 microservices, it's hard to adapt it to the best of your ability. You need to find a way to establish a mechanism in the team, such as the daily code collective review, which solves the daily management, ability improvement, review design ideas, each commit and other issues, to avoid the embarrassing scene of the wrong design falling into the version. Those who are more involved in the inspection and can often give constructive suggestions (inspection opinions) to others are the next Tech Lead that can be focused on cultivating, and believe me, collective inspection is the most effective way to manage a small team.
If you're a technologist, manage 100 microservices with about 2 million lines of code. At this point, it is not enough for you to rely on personal ability and code review. The team size of 100 microservices is generally 100+, and such a team size requires a good set of practices, from the architecture design of the software (modeling, design and implementation alignment), to requirements decomposition (parallelization, Tasking), to code submission (code continuous integration, small step submission, daily review), and finally to the test method (TDD, BDD, unified test framework).
But I want to focus on managing 500 microservices, which is about tens of millions of code scales. To manage such an organization and have these services evolve the way you want them to, you need to combine all the practices mentioned above, including individual capabilities, team atmosphere, good practices, and the application of methodologies. The difficulty is that it cannot be rigid, it must be tailored for the team, form a team consensus through influence, and formulate a landing strategy, which will be explained by example in the next article.
Manage 5,000 or so microservices further up, and such teams may often span geographies. I haven't really managed a project of this magnitude, so I don't have a say.
Strategies for dealing with legacy systems of different sizes
Small-scale strategies
Senior engineer
I personally think that senior engineers managing a microservice can transform a microservice very well as long as they are familiar with the design patterns and can use it skillfully.
The development of microservices may not even have the opportunity to use complex design patterns, and in the cloud-native field, the architectural module design of monolithic applications has been decoupled through microservices and infrastructure, and even Serverless has been achieved. For cloud-native development, you just need to focus on writing your business code.
Static scanning tools can sweep out bad code, because bad code has rules, there is no fixed pattern for writing good code, and the most reliable solution for identifying good code at this stage is still manual identification. The industry's better tools such as Codota, it is a free AI auto-completion coding tool, can be integrated in IDEA, after entering characters it will automatically help you associate to complete the best code snippets, but also fragment retrieval. In addition, reading Clean Code, Clean Arch, Effective Code, Design Patterns and other books can help you improve your code taste.
Technical Leader
If you manage 10 services, how do you make sure that those 10 services fulfill your architectural design intent? When I was the technical leader of the team, I insisted on doing code reviews every day. Code review is a very good tool, whose code is TDD, who has done the refactoring, who has ideas are all self-explanatory. The premise is that the review meeting can not only be the leader talking, the development is in a state of desertion, in my opinion, code review is the golden 30 minutes per day.
Code inspection actually has a lot of good practices, such as asking each student to go through it before reviewing the code, think about how to talk, have logic, and be easy for others to understand, so as to reduce the inspection time. Sometimes I write code in the morning, and by the afternoon I forget why I thought I wrote it this morning.
Secondly, we have requirements for the time of inspection, if you can insist on daily inspection, the daily inspection time will generally not exceed half an hour. The first step in reviewing is to go through commits and Comments and quickly let others know what I did today. Then quickly go through each line of code, and if you encounter some common problems and common problems during the browsing process, we can talk about it in the review.
Finally, in the inspection, we can find the design problems as early as possible and choose a more appropriate design model. After the code has been written, it is too late to find out that there is a problem with the design, and the technical debt has been formed. So the process of code review is a process of learning, but also a process of mutual promotion. We often find reviewed opinion leaders in code review, and these opinion leaders are our backup Tech Leads in our team.
In addition to the review, we also put some good practices into the team, such as TDD. How do I land on TDD? We asked for the best way to use TDD in the new requirements, which is nice-to-have, not need-to-have, so if anyone uses TDD in the review, we will give them encouragement.
DDD has been hot through microservices in the past two years, and this year we have also engaged in some DDD practices - Event-Storming modeling, DDD-based microservices practical projects. DDD is a concept, a set of theories, how to let everyone put DDD into the project in practice needs to have a DDD Demo Project, there is a pioneer team willing to try, I think this is the embodiment of a good team atmosphere.
Million-level code scale strategy
Technical experts
If you're a technologist managing 100 microservices and want to manage such a large team, you have to do all the practices mentioned earlier, which are cumulative processes, and theoretically you can't skip the small team management experience directly to the chief architect. It's normal that there are teams where Tech Lead doesn't endorse TDD and doesn't endorse DDD. Different Tech Lead choices are respected in our team and never one size fits all.
In addition to that, I think it's more important to manage so many microservices metrics. For example, to measure the team's architectural debt, how much code debt is, and convert the debt into time through a formula, so that the technical debt of team A is 30 minutes, and the technical debt of team B is 3 days. Probably with a Metrics dashboard, we can see whether its daily debt is growing or burning out, and if the data has more dimensions, then it needs to pass a maturity assessment, such as Level3, or 3 star team.
The other is Backlog, which is not a list to remember what needs to be done, and we usually invest in a Backlog for the team to refactor or write tests. If you ask a team or an engineer to have tests after writing code, and refactoring after writing code, generally he won't do it because it doesn't pay any dividends to him. I think refactoring and writing tests are also coding and output, so give the team a backlog and give the team a certain percentage of refactoring and developer testing. Developer testing includes UT, FT, AT, etc., as well as some performance tests and integration tests.
When I first experienced Agile 10 years ago, I felt agile was a management tool. Later I was exposed to some extreme programming engineering practices and some great people, and after working with these people for 1 year, I finally recognized Agile. There are fewer dual-mode IT companies like our company, the so-called dual-mode is IT plus CT, my feeling is that the IT field is more suitable for following the plan, and the CT field is more suitable for responding to changes.
To tell you a story, I have a friend who makes hardware, and he has to make a trip every quarter to send their designs to the workshop to produce a batch of prototypes. He was very worried every time, and if there were any design problems with the prototype, he would be in big trouble. Development models like this must be fully designed, fully validated, and fully tested before they can go live.
The CT field is more suitable for Agile, especially the Internet. I've also worked on the Internet, where the appeal is to quickly trial and error and allow grayscale to be published. If the feedback is good, increase investment, and the feedback is flat on the line. So the software field is more suitable for microservices, more suitable for agile.
The second is CI/CD, DevOps, a lot of teams have CI/CD pipelines, he can show you - look at our pipeline runs like this, but click into it and you will find a lot of problems. For example, when will microservices be dismantled? Many people may say that we have dismantled it when we model it, and this is analyzed by the architect. Years of work experience have taught me that microservices are not taken apart, but evolved.
I think the best time to split a service is in the process of integration. A service that takes too long to integrate can lead to a weakening of the overall rate of the team, and improvements are needed. If it takes developers more than 30 minutes to put together code at once, and if the team size is Two Pizza, it takes a long time for everyone to integrate once, which can lead to the inability to leave work on time.
At this time, through some means to optimize the test, optimize the integration efficiency is still not enough to meet the 30 minutes or less, you can analyze how to split the service, and then accordingly according to the needs of Conway's law to split the team, which is a way for me to identify what kind of service should be split.
Tens of millions of code scale strategies
Software chief engineer
When you want to transform tens of millions of lines of code, usually about 20 million lines of code, the team may not be in the same region, how to manage so many people, so many services? If the technology is not regulated to cover the vast majority of the technology stack, the difficulty will increase exponentially. In this scenario, whether you are a general engineer or a senior technical expert, you first have to build your common programming framework, infrastructure, common common components, and debt identification ability.
An important point mentioned in Evolved Architecture is called Fitness Function, which is an algorithm for genetics. Think about how you can make sure that the whole world evolves iteratively according to your ideas if you open up god's perspective? The world is random, we can only define some systematic equations, let the suitable equations survive, eliminate the unqualified, this is the survival of the fittest. At this point, if you can write a set of equations to land the evolutionary intent of the architecture and code, you can use the adaptive equation to control the entire market.
The other is Clean Architecture, a methodology of Uncle Bob (https://blog.cleancoder.com/uncle-bob/2012/08/13/the-clean-architecture.html). Our architecture evolution references Clean Architecture, but we've made a variant that works better for our Clean Architecture Plus.
In the process of transformation, architecture evolution inevitably encounters scenarios where infrastructure is incompatible, how to reconstruct the architecture? The first step is to abstract the infrastructure such as MQ, DB, DFS, Cache, etc., which is thin but important. The second step is that app development relies on new interfaces for development, which decouples application development from infrastructure, and since applications don't care what infrastructure is, it makes our services more Serverless and easier to write tests in the programming process. The third step is that the infrastructure team according to the API to implement, in the architecture reconstruction will inevitably appear to replace the infrastructure situation, for the application is not aware, in the infrastructure implementation process to do adaptation, Bazel can be built on demand, the entire process of the interface as a contract, through the contract test practice interface and contract control, contract testing has also been transformed according to the status quo, such as the use of git repo instead of service broker, development and construction process through git Sparse checkout for access, the operation of the cumbersome UNIFIED TOOLSET, so that the infrastructure architecture and application architecture to meet the dependency inversion and permission separation control. If you are a kind architect, the unification of the programming framework runtime is also essential.
A guide to practicing pit closure on the ground
People often complain that A technology is not good, B practice is not good, and the following picture will come to my mind:
Seller show and buyer show
This is typical of the "seller show" and "buyer show". I've seen too many teams turn agile into management, inspection into critical conferences, not so much as learning the essence, but rather stealing concepts, my experience is that you want to learn something, the best and fastest way is to work with them.
What is the essence of Agile? I think the essence of Agile is the 4 values in the Agile Manifesto. In the process of learning agile, if you are not careful, you will learn to become a buyer show, and then scold this theory is flawed! The theory is wrong! For example, many people are now discussing whether TDD is dead? Is design dead? I still think that any one thing has someone who can play it very well, and any one that is any good thing has someone who will execute it so badly, so badly, that the whole team's movements are deformed.
There are many team leaders who tell me that agile is deceptive, and think that agile is the action of some processes, and agile cannot play any role in the transformation of legacy systems, but only some management means. Many people think that code review is a waste of time, how can it be done every day? We submit the code to Committer for review. These things are different, only if you go to a good team to practice you can really appreciate what the essence of this is, you can do a good job of inspection.
I think that experience and experience are more valuable than practice itself. How is best practice born? Most of them use a practice as a template, combine the team's cognitive data and current status data, and after countless iterations, refer to the improvement opinions of everyone at the meeting to form a good practice. Software development cannot be completely replaced by AI (within 20 years) precisely because there is no silver bullet.
Summary
1. Characteristics of legacy systems of different sizes
The amount of code is particularly large
The technology stack is particularly comprehensive
The evolution time of the architecture is relatively long
2. Coping Strategies
Small-scale - familiar with design patterns, proficient use, code review
Million-dollar code scale – Metrics, architectural debt for metrics teams, agile
Tens of millions of code sizes – statistical debt, Clean Code, CleanArchitecture
Learn the essence of Agile
About the Author
Bai Sijian
Huawei Cloud Platform Technology Specialist
Captain of the Software Special Operations Team, huawei people newspaper fight in the world of 0 and 1 author, managing tens of millions of code scale line platform products.
Event recommendations
Legacy systems have always been the "hardest hit area" in the field of technology, and if you want to analyze the characteristics and problems of legacy systems in depth, I recommend that you study this column, which will solve the difficult problems of legacy system governance from the four directions of code, architecture, DevOps and team modernization, and help you and your team get out of the quagmire of legacy systems. Now open the super membership for the first month of 6 yuan to unlock, there are 200+ system lessons and 1400+ video lessons, waiting for you to taste!
Scan the code to open a member, save 99 yuan
Read the column "Legacy System Modernization Practice" for free
Click on one to see fewer bugs