Platform Team: Automate infrastructure requirements gathering

By bridging the gap between development and operations teams with automated resource specifications, you can create a more harmonious and efficient deployment process.

译自 Platform Teams: Automate Infrastructure Requirement Gathering，作者 Rak Siva。

One of the most challenging issues in application development is the disconnect between the development team and the operations team. Communication challenges can easily lead to inconsistent expectations and failed deployments. One of the most critical and vulnerable areas of communication between these two teams is infrastructure needs, which for years seemed unsolvable.

But now there's finally a solution to this communication gap: automation to simplify infrastructure requirements gathering.

Communication challenges

Platform engineering teams often face difficulties in gathering accurate requirements from the development team about their applications. Developers are often unaware of the specific infrastructure information needed and may provide incomplete or incorrect data.

Of course, there are more extreme cases of "throwing over the fence". Once the developer has finished building the application logic, they hand it off to the platform team, leaving them with the struggle to figure out what infrastructure, configuration, and permissions are needed to run it reliably, securely, and efficiently in the cloud.

Poor communication of infrastructure requirements can lead to infrastructure drift, where the infrastructure deployed no longer matches the actual needs of the application. This drift can lead to application failures, resulting in stressful deployment days, late-night troubleshooting sessions, and dreaded war rooms.

Infrastructure drift and its consequences

Infrastructure drift is when the actual state of the infrastructure deviates from the expected state as defined in the Infrastructure as Code (IaC) script. Given the challenges of manually communicating infrastructure requirements, it's no surprise that teams experience drift. Questions include:

Manual changes: Developers or operations teams may make manual changes to the infrastructure without updating the IaC scripts.
Inconsistent updates: Updates to applications may not be reflected in the infrastructure configuration.
Lack of communication: Developers may not be able to effectively communicate new requirements or changes to the operations team.

The consequences of infrastructure drift are severe:

Deployment failure: A mismatch in the configuration can cause the deployment to fail, resulting in application downtime.
Increased stress: Operations teams often have to deal with last-minute fixes, resulting in long working hours and high stress levels.
Reduced trust: Frequent deployment issues can erode trust between development and operations teams, making future deployments more difficult.
Higher costs: Infrastructure drift incurs costs due to lost revenue due to downtime, additional expenses due to misconfigured resources, increased labor costs to resolve issues, and security vulnerabilities that need to be fixed.

The solution to these infrastructure drifts lies in automation, which stems from the challenge of communicating infrastructure needs. Let me introduce the concept of resource specification, which automatically communicates runtime application requirements to the operations team.

Solution: Automated resource specification

Imagine a system that can infer the required infrastructure resources directly from the application code. The system generates a resource specification that acts as a real-time document detailing the runtime requirements of the application. You can then use this specification to automatically configure your infrastructure to ensure that the resources deployed exactly match the needs of your application.

It's also conceivable that while infrastructure requirements are extrapolated from application code, the operations team still retains control over critical decisions. They choose a cloud provider, service, and security configuration for each resource, allowing them to apply their expertise and execute best practices. This ensures that the infrastructure remains robust and meets organizational standards, combining automation with expert oversight.

This is at the heart of the new concept of Infrastructure as Code (IfC), which is built on top of Infrastructure as Code (IaC). This means that an IfC framework like Nitric can provide operations teams with the solution they've been looking for: a real-time, detailed specification of the resources and permissions required by an application.

An example of an automated resource specification

The following is an example of how to generate a resource specification from application code. This application runs once a day and publishes an update event that contains the URL.

from nitric.resources import api, bucket
from nitric.application import Nitric
from nitric.resources import schedule, bucket, topic
from nitric.application import Nitric

images = bucket("reports").allow("deleting","writing")
updates = topic("updated").allow("publish")

processor = schedule("process-reports")
@processor.every("5 days")
async def process_transactions(ctx):
    download_url = await images.file('report.csv').download_url(3600)
    await updates.publish({
        'url': download_url
    })

Nitric.run()

From this code snippet, the Nitric framework gathers the following information:

Bucket Resources: ID:reports Configuration: Default settings.
Theme Resource: ID:updated Configuration: Default settings.
定时任务资源：ID：process-reports配置：目标服务 hello-world_services-hello，每五天执行一次。
策略资源：ID：eccfffd7a5e31407be6f7a5663665df4配置：允许 hello-world_services-hello 服务对 reports 存储桶进行读写操作的策略。
策略资源：ID：74e4fa18c1527363767c00582b792ed9配置：允许 hello-world_services-hello 服务对 updated 主题执行自定义操作 200 的策略。
Service Resource: ID: hello-world_services-hello Configuration: A service with the mirror hello-world_services-hello, a worker process, and an environment variable NITRIC_BETA_PROVIDERS set to true.

This information is compiled into resource specifications, ensuring that all necessary resources are configured accurately and consistently.

Note: IDs are automatically generated and are used to uniquely identify a resource.

Automatically apply resource specifications

Auto-generated resource specifications address most of the communication and drift issues discussed above. But platform teams can get more benefits from automating the application of these specifications to the IaC modules they create. Frameworks like Nitric used in my example above also automatically script deployments for platform teams.

Using resource specifications, each component is mapped to the corresponding IaC module. For example, if your application specifies a bucket resource and the target cloud provider is AWS, the system will use the Terraform module to configure a schedule handler:

# Create role and policy to allow schedule to invoke lambda
resource "aws_iam_role" "role" {
  assume_role_policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Effect = "Allow",
        Principal = {
          Service = "scheduler.amazonaws.com"
        },
        Action = "sts:AssumeRole"
      }
    ]
  })
}

resource "aws_iam_role_policy" "role_policy" {
  role = aws_iam_role.role.id
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Effect = "Allow",
        Action = "lambda:InvokeFunction",
        Resource = var.target_lambda_arn
      }
    ]
  })
}

# Create an AWS eventbridge schedule
resource "aws_scheduler_schedule" "schedule" {
  flexible_time_window {
    mode = "OFF"
  }

  schedule_expression_timezone = var.schedule_timezone

  schedule_expression = var.schedule_expression

  target {
    arn      = var.target_lambda_arn
    role_arn = aws_iam_role.role.arn

    input = jsonencode({
        "x-nitric-schedule": var.schedule_name
    })
  }
}

This automatic mapping ensures that the deployed infrastructure is in sync with the requirements of the application, preventing drift and reducing the likelihood of deployment failure.

No more "throwing bricks and stones"

By bridging the gap between development and operations teams with automated resource specifications, you can create a more harmonious and efficient deployment process. This approach not only reduces the risk of infrastructure drift and deployment failures, but also fosters better communication and trust between teams. Taking this approach can lead to more reliable, easier deployment and a more robust infrastructure.

Consistency: Automated resource specification ensures that the deployed infrastructure matches the needs of the application, reducing the risk of drift.
Efficiency: Reduces deployment time and minimizes the need for manual intervention by automating the generation and configuration of resources.
Reduced stress: Operations teams can trust that the infrastructure will be configured correctly, resulting in smoother deployments and fewer late-night troubleshooting.
Improved communication: Developers don't need to worry about manually specifying infrastructure requirements; The system processes automatically to ensure that requirements are accurately communicated.

Learn more about this approach by looking at what we've built using the open-source Nitric framework. We'd love to hear your feedback, ideas, and contributions to help automate the most tedious parts of platform engineering.

Platform Team: Automate infrastructure requirements gathering