laitimes

Ansible's practice in Zhihu Big Data

author:Flash Gene

This article mainly explains the design and selection of the O&M architecture, as well as Ansible's practice in Zhihu big data.

1 Background

At present, big data O&M can be roughly divided into three directions:

  • Buy commercial products, such as CDH.
  • Continue to maintain the legacy Apache Ambari (the community is out of maintenance, and the cost of continuing to maintain it is higher).
  • 使用运维框架 Ansible、Chef、Puppet 来开发运维平台,或自研。

The development of Zhihu's big data operation and maintenance is divided into the following three stages: single-point, semi-automatic, and engineering. It is currently in the engineering stage.

  1. Single point stage: consists of semi-automated large number of shell scripts + scattered salt. The main problems at this stage are:
    1. Most changes, such as scaling and configuration changes, become one-time work and cannot be idempotent.
    2. Without engineering, there are a lot of personal style codes, and it is difficult for others to participate in joint maintenance, and it is easy to have a single point of problem.
  2. Semi-automatic phase: Start refactoring with Ansible to the original legacy shell. In this way, the idempotency problem is solved, but derivative problems such as slow deployment efficiency and unreasonable architecture design appear.
  3. Engineering Phase: Realized the merger of engineering and efficiency. Because big data Hadoop ecosystem maintenance focuses on configuration and services, an engineered and multi-person collaborative architecture is urgently needed, which can not only solve the problem of operation and maintenance efficiency, but also solve the single-point problem, and meet the exclusive big data needs of Zhihu.

2 Objectives

Zhihu's objectives in the engineering phase are as follows:

  • Fast compatibility: Compatible with big data Apache Hadoop ecosystem O&M in a short time;
  • Historical Takeover: Takeover of legacy scripts;
  • Engineering: Convenient configuration management and multi-person collaboration;
  • Reduced operational mentality: The complexity of development and use should not be higher than that of legacy shell scripts.

3 Big data O&M architecture

3.1 Architecture Overview

O&M automation is our long-term goal, but blindly pursuing automation is easy to ignore the real business requirements, such as the ease of use and stability of scenarios such as automatic deployment of business back-end architecture or big data architecture, configuration changes, service changes, and cross-data room migration.

Business is the litmus test for O&M quality. What kind of architecture design is development- and operation-friendly and business-responsible? Here are a few key takeaways:

  • Architecture-independent: An excellent O&M system needs to have the ability to deploy, test, and decouple dependencies.
  • Deployment-friendly: Deployment-friendly requires a deep understanding of the relationship between applications and services, an assessment of potential risks, and careful selection of environments and dependencies.
  • Reduce the mental cost of O&M: During O&M development, you need to consider the mental burden of O&M, try to do idempotent, protective, and collaborative development, and invest a lot of stable development work in exchange for safer changes.
  • Data observability: At the beginning of the design of all systems, data must be observable in order to effectively iterate. Data observability includes, but is not limited to, monitoring, stability, scalability, and collaboration.

3.2 Frame selection

Based on the above, we need to be compatible with the previous big data O&M system in a short period of time, solve the problems left over from the past, and ensure engineering, multi-person collaboration and efficiency, while reducing the mental cost of O&M.

We compared the following operational tools:

- Puppet Chef Ansible CM&CDH Ambari
Development language Ruby Ruby Python Python & Hybrid Java
Secondary development In the tank In the tank In the tank Commercial support Support (community maintenance is unstable)
O&M DSL Puppet DSL DSL Chief YAML WebUI WebUI
Complexity of installation middle middle low high high
Architecture Master-slave + proxy Master-slave + proxy Master-slave + no agent Master-slave + proxy Master-slave + proxy
Communication & Encryption HTTP、SSL HTTP、SSL SSH、OpenSSH HTTP、SSL HTTP、SSL
Configuration PULL PULL PUSH & PULL PUSH & PULL PUSH & PULL
Community & Documentation abundant abundant abundant Commercial support Community maintenance is unstable
Cost of study high high low Medium (Business Support) High (unstable community maintenance)

(Note: The community launched the Ambari Resurrection Project in early 2023)

In comparison, it is found that Ansible has a low learning cost, meets the needs of low-cost and fast compatibility, and its Agentless (SSH) architecture inherently has the advantage of fast operation and maintenance. So, in the end, we chose Ansible for big data automation operation and maintenance management.

3.3 Ansible 介绍

Ansible is an open-source automation tool written in Python that is very scalable and customizable, and is designed with the following in mind:

  • It provides the function of Inventory host inventory, which can store the hosts that need to be maintained and key variables here for easy management.
  • Modules are provided to facilitate the development of specific O&M capabilities, such as replication, decompression, configuration update, etc.
  • The roles function is provided, which can connect multiple modules in series for easy dependency or reuse.
  • Playbooks are provided for complex workflow orchestration;
  • It provides two execution modes of quick debugging, Adhoc and Playbook, which can realize one-click automation of multiple processes and can also be quickly debugged.
  • All functions come with idempotency, and idempotent operations can be achieved with a simple call;
  • The learning requirements are very low, no programming language foundation is required, only the management language YAML is required;
  • There is no need to install any agents on the managed nodes;
  • It provides a wealth of built-in optimization strategies, which are not as good as agent-based tools, but can easily handle clusters with thousands of units.
  • Supports almost all operating systems and cloud platforms.

3.4 Ansible Architecture Design

The architecture is divided into four parts:

  • Inventory 设计;
  • Reasonable use of modules and plugins to expand functions;
  • Playbooks that follow the one-click principle, and with tags, they can flexibly organize the process;
  • Atomize, dependentize design roles.
Ansible's practice in Zhihu Big Data

3.4.1 Inventory 设计

Inventory is generally a feature used to manage host inventory. But it's not just a simple list, it can go up to a "literal topology description" of a cluster, and after using this idea, you can clearly understand the programs, versions, key variable configurations, and other information in the cluster.

The community generally recommends using the dynamic and static combination mode (docking with the CMDB), and multiple inventories can be configured. But our current scale, using static mode is enough to support. The community also provides a flexible Inventory docking mode for easy switching in the future.

The internal design of Inventory is divided into two parts, the Global Critical Configuration section and the Host Group section.

  1. Global Key Configuration: Store some core global key variables, such as version information, key configuration information, key directory information, and some important mode variables in engineering design.
  2. For example, Hadoop DataNode and NodeManager will define multiple groups, and the number of machines in the group is composed of 1, 10, 500, and thousand, which has the advantage of being able to scale up in groups when changing (which will cause configuration redundancy, which can be alleviated by docking dynamic mode), and the host group will also be equipped with exclusive host group variables.

Through the use of the global key configuration part and the host group part, as well as the global key variables and host group variables, you can quickly understand the current status of the cluster topology, reduce the cost of topology environment learning, and implement most changes, such as scaling, key configuration adjustments, and version upgrades, by modifying the inventory.

3.4.2 Extensions

Ansible provides Modules and Plugins to complement and extend functionality.

Modules are used to complement the functionality of Ansible's built-in modules, allowing users to implement specific functions through custom modules, such as managing files, users, installing software, creating cloud resources, and so on. And the implementation of Modules is not limited to a specific language, you can use Python, Bash, Perl, PowerShell scripts, and so on.

Plugins are the components that extend their functionality, including modules, connectors, variable plugins, callback plugins, and so on. Plugins are pluggable units of code that change the behavior of Ansible and extend its functionality. There are three common types of plugins: Inventory Plugins, Connection Plugins, and Callback Plugins. Ansible plugins are often used to implement advanced automation tasks, such as using plugins to customize notifications during deployment, extract information from different data sources, and so on. (Note: Ansible provides more than 10 types of plugins, and only 3 commonly used ones are introduced here).

I would like to add that how do you distinguish between when to use Modules and when to use Plugins?

Ansible has built-in efficient modules such as copy, template, and unarchive, for example, after using these module functions, you need to call external interfaces to obtain relevant data, or implement temporary notification sending, at this time, you can use Modules to supplement.

After the overall script is executed, you can use plugins to expand if you want to do some data collection, data statistics, or call back other operations.

Therefore, in this part, we use Modules to implement other functions such as WeCom notification, specific service interface data acquisition, etc., while Plugins only use the Callback part. Plugins such as Var, Lookup, and Action have a high threshold for use, and improper use is easy to backfire.

3.4.3 One-click operation

Playbook orchestration is the first test of the operational kernel design. It is a way to treat the system configuration as a target state, rather than a series of actions that must be performed. In other words, we only need to describe the final state of the system, not tell Ansible which commands need to be executed.

Playbooks promote modularity and reusability. Each playbook can contain one or more roles or tasks, and each role can contain multiple modules, which greatly increases flexibility. Modules can be distinguished by tags, which not only allows you to complete all operations with a one-click playbook, but also enables detailed control of the process.

However, with so many relationships and nesting, it's easy to over-engineer or redundantly design. For example, when we deploy the DataNode and NodeManager of Hadoop, we often put the same type of operations in a YAML file, resulting in a lot of noisy information and a large number of skips when actually executed, which increases the mental burden. So we rethought the original statement, "We only need to describe the final state of the system, not what commands to execute".

Therefore, we found that the functions of creating directories, configuring variables, and starting and stopping services that we developed in the playbook in the early days should not be placed here, but should be wrapped in Roles. For example, we need "jdk, jmx_agent, execute_account, hadoop_pkg, datanode_conf, serv_state_check" and other related roles, which can be organized or used as methods in software development.

Finally, the playbook will be designed to orchestrate the logic using a variety of functions, just like programming language development.

3.4.4 Character Design

The core idea behind Roles is to break down system configurations into reusable components to improve maintainability and reusability of code. By breaking down complex system configuration tasks into multiple independent reusable roles, each role focuses on completing a specific function.

At the same time, because the development of Roles is a dynamic script, unlike heavy industrial frameworks such as Spring Boot in Java development, which have a large number of conventions and restrictions, and each Roles contains a set of related tasks and variables, in this multiple environment, it is easy to have the problem of cyclic reuse, how to avoid such a problem?

  1. Determine the scope of a role: When designing a role, be clear about its scope and limit it to a specific task or functional area as much as possible. This ensures that the Role contains only the necessary tasks and variables, avoiding unnecessary reuse.
  2. Avoid overly abstract naming: To make Roles more generic and reusable, overly abstract naming is sometimes used, such as common or base. However, this abstract naming can cause Roles to become complex and difficult to understand. Therefore, it's a good idea to use a more specific and unambiguous naming to better express the scope and purpose of Roles.
  3. Misuse of dependencies: Sometimes, there are tasks and variables that are common to multiple Roles. To avoid rewriting these codes, you can put these common tasks and variables separately in a Role and make them dependencies on other Roles.
  4. Flexible use of inheritance and inclusion features for Roles: Ansible provides inheritance and inclusion features for Roles that can help avoid circular reuse. With inheritance and inclusion, you can abstract some commonly used tasks and variables into a base role, and then include or inherit this base role in other roles, avoiding rewriting code. However, when using these features, you need to make sure that the scope of the underlying Role is clear so that it doesn't introduce unnecessary complexity.

In Zhihu, we have iterated on the design ideas of three types of Roles, from "straightforward command ideas", to "type division ideas", and finally to "atomic method ideas", which are introduced as follows:

3.4.4.1 Straightforward command ideas

In the early days, the "straightforward command idea" was adopted, such as the following code is a direct operation performed by the HDFS Router, and there are many problems with the reusability, dependencies and compatibility of this role. Because of this straightforward logic, many external variables are used to implement the main logic, and if the external dependencies change, Roles will be invalidated. This straightforward scene is usually done to meet the needs of the time, without much design consideration.

# 直白命令思路
├── hdfs
│   └── router
│       ├── files
│       │   ├── ***
│       ├── tasks
│       │   └── main.yml
│       └── templates
│           ├── ***
│           └── ***           

3.4.4.2 Type division ideas

The medium-term "typing idea" provides a framework for the modularization of Roles, distinguishing the deployment of installation packages, configurations, and services through variables, thus improving a certain degree of reusability. But it doesn't really solve the problem, but instead adds some process variables and increases the complexity of function management.

If you design in this way, you won't be able to work with more complex logic. Moreover, you will mistakenly think that using A logical variable corresponds to the installation package, B logical variable corresponds to the configuration, and you can use logical variables arbitrarily in the playbook, which is easy to encounter another embarrassing problem in Ansible's large-scale project: "variable duplicate replacement".

# 类型划分思路
grafana
├── tasks
│   ├── cfg_file_deploy.yml
│   ├── main.yml (when cfg, when pkg, when serv)
│   ├── pkg_deploy.yml
│   └── serv_make.yml
└── templates
    ├── ***
    └── ***           

This problem of "variable repeated replacement" has plagued us for a long time in the iterative process of Zhihu's big data operation and maintenance kernel. Because in normal development such as Java or Go, after a piece of logic re-assigns a variable, the new value can take effect in the current scope. However, in Ansible, if a variable is assigned multiple times, it does not distinguish the scope, but takes the content of the last assignment, resulting in an incorrect deployment.

Therefore, we have summarized the following points

  1. Normative Roles: It can be understood according to the methods in programming, all methods need input and output parameters, then the variables required in Roles can be defined according to the input parameters. For example, the default value, input parameter detection, parameter type, and can be received by different variable names after passing;
  2. Canonical scope: Although variable scopes can be used to control the lifecycle of variables to avoid variables being repeatedly replaced, such as hostvars or group_vars to define the scope of variables, so that each host or group has its own independent variable space;
  3. Dependency representation: If a variable depends on the value of another variable, this dependency should be clearly indicated in the Role. This can be achieved using Ansible's variable dependencies, rather than relying directly on assignments, which are prone to problems that don't exist with unknown variables.

3.4.4.3 Ideas for atomic methods

The current "atomic method idea" sacrifices a little engineering convenience and increases the complexity of some engineering directories, in exchange for a large number of atomic Roles, the simple understanding is that we atomically split the details that need to be done, and then they can be passed to each other with parameters, and constrained by Roles-level dependencies, and in the playbook, you only need to combine Roles, and you don't need to care about the implementation details of Roles. So it's refreshing to understand that "we only need to describe the final state of the system, not what commands we want to execute."

# 原子方法思路
alluxio
├── alluxio_conf
│   └── ***
├── alluxio_master_serv
│   └── ***
├── alluxio_pkg
│   └── ***
├── alluxio_worker_cache
│   └── ***
└── alluxio_worker_serv
    └── ***           

3.5 Hadoop ecosystem O&M instances

3.5.1 Hadoop O&M Overview

Zhihu's big data ecosystem uses Apache Hadoop. In the early stage of single-point shell, although it can quickly respond to simple scenarios, it is relatively difficult to encounter large-scale and complex scenarios, and it is difficult to ensure its own stability.

The O&M of the Apache Hadoop ecosystem is relatively complex, mainly because the big data ecosystem is composed of multiple components, and each component has complex dependencies and configuration parameters. In Zhihu, all big data-related components have been migrated to the cloud (bare metal), and there are a lot of customization requirements at the deployment level, so we are facing the following problems:

  1. Complex deployment: The basic installation of Apache Hadoop clusters already involves the configuration management of multiple components, and Zhihu storage and computing have implemented a federated architecture, which greatly increases the difficulty of deployment.
  2. Complex and diverse configurations: Zhihu's big data cluster has multiple environments and cross-computer room scenarios, and the complexity of testing, grayscale, and multi-cluster configuration has risen to a level that is difficult to maintain.
  3. Difficult to store and manage data: The Apache Hadoop ecosystem supports the storage and management of massive amounts of data, but it also introduces complexity in data storage and management. Because it is necessary to connect with multiple different forms of services, the pressure on multiple storage master nodes has exceeded the recommended value of the community, which has brought great problems to O&M changes and stability.
  4. Long change cycle: Because of the Apache Hadoop ecosystem used by Zhihu Big Data, components such as HDFS, YARN, Spark, Flink, Presto, Hive, scheduling, and synchronization are all customized, which brings great difficulties in scenarios such as version updates, function expansions, and bug fixes.
  5. Business continuity and reliability issues: Data processing and analysis tasks in the Apache Hadoop ecosystem often involve business-critical tasks and need to ensure the continuity and reliability of tasks.

3.5.2 Comparison of Hadoop O&M practices

Next, we will list several deployment, configuration, and collaboration scenarios in Hadoop operation and maintenance to compare the differences between Zhihu's early scripts and the current Ansible engineering, as well as commercial CDH at the development and operation and maintenance levels.

3.5.2.1 Legacy shell O&M methods

The following issues occur when you pass parameters to historical scripts for deployment:

  • The deployment process has no control over scope and services, and the operation starts all over again.
  • Unable to roll deployment;
  • It is difficult to observe the deployed service.
Ansible's practice in Zhihu Big Data

3.5.2.2 Engineered Ansible O&M

Submit a Merge Request to the project to perform O&M changes as follows:

  • Select the roles and playbooks that have been designed;
  • 变更 Inventory 配置新增的节点数量;
  • MR review and wait for the code to be merged, and then execute the playbook to complete the scale-out.

As shown in the image below, our changes were made in Inventory, and this change was managed with Git for version control and review. It's easy for others to review or connect with other GitOps, so you can clearly know what we're changing. and the risk of change.

Ansible's practice in Zhihu Big Data

From the perspective of an operations developer, we can see that there are a large number of related Hadoop Roles, and specific playbooks, as mentioned earlier, "we only need to describe the final state of the system, not what commands to execute".

The final execution is a logical command, such as what to do and in what cluster, and the Ansible command is "ansible-playbook hadoop_worker_deploy.yml -i inventory_abc.ini".

Ansible's practice in Zhihu Big Data
Ansible's practice in Zhihu Big Data

It should be noted that the changes are in the form of code, and we use code review to carry out operation and maintenance collaborative development.

We also have some WebUI, if you don't want to worry about the details for the time being, just select what you need to do and click RUN, and Ansible and the current framework design will ensure stability for you.

Ansible's practice in Zhihu Big Data
Ansible's practice in Zhihu Big Data

Note that you read that right, this scaling process is the deployment process of the Hadoop Worker (DataNode,NodeManager,***), which will perform relevant operations on the specified machines according to our changes and machine groups. Because of the pre-idempotent design, only the content in different states will be changed, and the existing content will be pre-detected and then executed as needed. A lot of the maintenance work is done with the same execution logic, and although there will be subtle differences in the execution of commands, they can all be encapsulated as a button on the WebUI.

3.5.2.3 Commercial CDH O&M mode

CDH management is done by CM, which can easily manage Hadoop clusters, update configurations, monitor, scale out, and more. To scale out Hadoop HDFS using CDH, perform the following steps:

  • On the CDH management interface, select Add New Hosts to add the new node to the cluster.
  • 在 NameNode 上运行 hdfs dfsadmin -refreshNodes;
  • Start the DataNode service on the new node. This can be done by selecting Add a new node on the CM management interface and starting the service.

The following is an example of a common scale-out process. It can be seen that the design of the CM, as well as the management functions, are already rich, and everything can be done with simple web operations.

Ansible's practice in Zhihu Big Data
Ansible's practice in Zhihu Big Data
Ansible's practice in Zhihu Big Data

3.5.3 Hadoop 运维对比总结

According to the comparison of the above basic O&M functions, it can be clearly seen that:

  • The historical shell scripting method has a single function and is not capable of complex scenarios, and the cost of changing minds is high.
  • Ansible's change process, because of the architecture design, makes its basic operation and maintenance module development, production service changes in the form of code review. Although compared with scripts, there are also similarities that need to be understood before daring to change, but this is an engineering architecture, which is the result of multi-person collaborative development and testing output, and has been verified in terms of change stability, so it has the ability to complete automatically with one click;
  • The commercial product CDH can be seen in the screenshot, showing all the processes in an observable way, which is very helpful for changes. However, this is both an advantage and a disadvantage, and if something goes wrong, you may need commercial support, or commercial training and a dedicated presence.

Finally, the practice of engineering is a qualitative leap forward from the historical shell scripting approach. Comparing CDH, in terms of product design concepts such as functional stability and WebUI, it may be far away, but for the high degree of customization within the company, Ansible can get the job done efficiently and cost-effectively, which is why we chose Ansible.

4 Ansible Tuning

4.1 Efficient idempotency

Another great thing about Ansible is its built-in idempotency. Because it only needs to call the built-in method simply, it can achieve both requirements and idempotent operations, and it can be run repeatedly to avoid inconsistencies in multiple runs.

The principle of idempotency is based on the state checking mechanism inside each module. Here are a few examples of implementations in the source code:

file module idempotency:

1. 首先使用 os.stat 函数获取指定文件的状态信息,包括文件的大小、权限模式、所有者、组等属性;
2. 然后根据参数中指定的属性信息,检查文件的状态和属性是否满足要求。如果不满足,则需要修改文件的状态和属性;
3. 根据参数中指定的属性信息,使用 os.chmod、os.chown 和 os.utime 等函数,修改文件的权限模式、所有者、组和时间戳等属性,使其满足要求;
4. 最后,该方法会返回一个布尔值,表示文件的状态和属性是否已经满足要求。           

Idempotent of the user module:

1. 如果用户或组已经存在,则不执行任何操作,否则会创建用户或组;
2. 对于已存在的用户或组,Ansible 会检查其属性(例如用户名、UID、GID 等),并根据需要修改这些属性以保持一致性;
3. 对于需要创建的用户或组,Ansible 会使用系统命令来创建用户或组,并设置其属性。           

All of their implementations have a commonality of multiple checks before and after execution.

The problem here is that if you need to create 100 directories or 100 users repeatedly, it is inefficient to continue using the default idempotent method. The first code to be detected is a multiple call stack "ansible yaml -> python -> c", because the number of checks is different depending on the module, and the check time is higher when the module has a large number of tasks to perform. Although the community suggests that you can use batch processing modules, async+poll, and loop loops to speed up, they do not solve the problem of "reasonably avoiding redundant idempotency detection" at the root. Therefore, when the scale of the project reaches a certain level, some reasonable pre-testing can be done according to the requirements to avoid redundant idempotency detection.

For example, in the previous example, you need to create 100 user scenarios, and if you use some optimization suggestions from the community, it will complicate the problem. In fact, you can do some pre-detection before heavy logic according to your needs, first obtain the user information of the current host, diff it with the user list that needs to be created, and pass the incremental content of the diff to the subsequent logic, which can greatly improve the efficiency of repeated running. The same is true for other modules, which can even implement idempotency for specific requirements.

4.2 Reduce the cost of operation and maintenance

Compared with other O&M tools, when the project is large, it is easy to fall into a complex hell, resulting in a very high mental cost of change.

What kind of design concept can reduce the cost of operation and maintenance?

It should be a one-click operation and maintenance change operation to effectively reduce the cost of operation and maintenance. There is a framework in the IaC (Infrastructure as Code) concept, Terraform, which is elegantly designed to achieve this state.

A brief introduction to the design concept of Terraform, all the results of operation and maintenance deployment (clusters, services, variables, configurations, etc.) are saved in a state, and each change will be a state diff (the number of cluster nodes increases, two configuration files are added, and these diffs will be reflected in the state file, and finally only use the change of services or hosts with different states, and other fault tolerance, idempotency, etc. are guaranteed by the framework, which is also Ansible imperative The biggest difference between Terraform declarative is that Terraform only needs to declare my final schema state, and the rest is done by the tools.

In the actual operation and maintenance of Terraform, all changes (resource provisioning, destruction, and change) are one command, because the state diff detection is used, even if it is executed multiple times, there is no need to worry about errors, which truly liberates the mental cost of operation and maintenance.

On the other hand, the current Ansible operation architecture design really can't make incremental changes like Terraform relies on state diffs (the difference between imperative and declarative tools). However, Ansible has matched the architecture design, planned the variable layout, and unified the development specifications of Roles. Or in that direction?

This is indeed a good direction, we still have a long way to go, and there are still many things that need to be iterated on, such as how to use the retry mechanism to use the retry mechanism to make reasonable automatic retries when the script is interrupted during the execution of the script. The dynamic and static mode of Inventory combines how to plan clusters reasonably. How idempotent avoids a lot of redundant checks. All of the above should be progressing towards one-click completion.

4.3 Efficiency breakthrough

Common optimizations include concurrent execution, reducing SSH connections (ssh_args, pipelining, SSH ControlMaster), caching plugins (cache_plugin), and asynchronous execution (async, poll). I won't go into too much detail about how to use it, because this kind of information is very common, so we will share some content that can make a big breakthrough in efficiency.

  • Independent core configuration of the project: You can place a custom ansible.cfg file in the project to customize some optimization parameters, and the default host, SSH tuning, plug-in path, key variables, etc. can be completed through the project custom cfg, and it is also recommended to use customization, so that you can better understand the project;
  • Unified Python environment: When using pipelining, it is best to unify the SSH permissions of the Python environment and the operation and maintenance account, and the SSH connection level can be reduced by at least 30% after testing.
  • SSH session persistence: It is recommended to use SSH control_path to save SSH socket status files, the principle is also to reduce the connection cost of SSH, it should be noted that the directory where the socket is stored should be created in advance;
  • Use vars_plugins with caution: Share a big hole that you have stepped on, the purpose of using vars_plugins is to quickly get variables, but there is a problem here that the custom implementation of vars_plugins will be connected to the main process of the playbook, and the main process in the source code will be hidden abnormally, when the vars_plugins custom implementation is not handled properly, it will affect the execution of each playbook, and if it is serious, there will be loss of logic;
Ansible's practice in Zhihu Big Data
  • Callback plug-in: It is recommended to implement the task execution time statistics function in callback_plugins, so that you can view the efficiency at the end of each execution, facilitate the location of slow logic problems, and achieve data observability.
Ansible's practice in Zhihu Big Data
  • Information Collection: gather_subset When collecting system information, it is best to collect it optionally, such as only collecting network or hardware, and the default collection is particularly time-consuming. Here's a reference:
- name: Gather hardware info
  setup:
    gather_subset:
      - '!all'
      - '!any'
      - hardware           

5 Results and Prospects

After the iteration and practice in the engineering stage, the results are as follows:

  • Support cross-computer room O&M;
  • Complete the deployment of PoC scale (10 ~ 100 units) of big data clusters within 1 hour;
  • If the business is not affected (or very low-impacted), all clusters will be changed on a rolling basis in 4~7 days.
  • Configuration-only updates 5 minutes for thousands of units;
  • 全流程部署 1 台/70~100s;
  • Half a day of document reading can start development;
  • 70+ 功能角色模块 Roles、30+ 固定、10+ 动态流程剧本 Playbook。

Future Prospects:

This article mainly talks about the design and efficiency improvement of the O&M architecture kernel, and will continue to improve ease of use and stability, simplify web UI operations, and continue to work towards one-click completion.

Author: Chen Dacang

Source: https://zhuanlan.zhihu.com/p/617731670

Read on