laitimes

OTEL是DEVOPS成功秘诀

author:Cloud and cloud sentient beings
OTEL是DEVOPS成功秘诀

The real work of development and operations doesn't begin until the code is deployed, which raises the importance of observability to application performance.

译自 OTel Is the Secret to DevOps Success,作者 Clay Roach。

The old boundary of "developers write code, operators run code" no longer exists. If you write, design, or contribute to an application, you have some responsibility for the execution of the application in production. At some point, you will be asked to diagnose and fix it.

When creating an application, developers need to have the mindset from the start that the real work starts after the code is deployed. That's when developers see how their apps perform in the real world and make sure they provide a positive customer experience.

By using application performance management (APM) tools to capture detailed information about code and business processes upfront, operators can inherit all the good, thoughtful work that developers do. Operations staff can pinpoint critical issues faster during incident response, saving time and effort. With access to information that helps fix errors and latency issues specific to each application, life becomes easier for both developers and operators.

Essentially, APM is about implementing DevOps, and developers who are building better observability into the applications they develop put themselves at the forefront of that implementation.

Development and Operations: Different Perspectives

Although developers and operators have something in common, they still operate from different perspectives. Developers dedicate their careers to creating applications that are critical to the business. With every application written, developers want to be creators, troubleshooters, and fixers.

Developers also want to see the difference between usage and input after a new feature goes into production. Is the app working as expected? Does it provide value to customers? Are the business processes supported by the application improved?

At the same time, operators take a holistic view of application and infrastructure performance. Is everything working properly? Are infrastructure changes impacting application performance? Does this issue affect other services? Are we meeting customer expectations or contractual service-level agreements (SLAs)? If it's a code issue, who needs access to this information to fix it?

With this in mind, how can we get development feedback loops and the best business metrics for true DevOps?

Pros and Cons of OTel:

The key to creating a high-performance application that doesn't consume resources is to understand the application code in production through thoughtful instrumentation such as OpenTelemetry (OTel). By capturing details about application processes and dependencies as they create code, developers can save a lot of time later when they need to fix issues or improve performance.

OTel supports the use of both automatic and manual detection for the same application. This instrumentation enables developers to add code snippets to capture and send custom metrics specific to their applications.

Auto-detection provides pre-built libraries or agents that capture standard metrics such as CPU usage, memory usage, request latency, and error rates. Automatic detection requires developers to modify the code, making it simpler and faster to implement, but less flexible.

Manual and custom detections give DevOps teams easy access to details about what happened and why, and present them in a useful format. Additionally, using OTel can help you design and improve monitoring in both local and test environments so you know what to expect in production. You won't have different datasets in different environments because the tool is the same in all environments.

However, OpenTelemetry itself doesn't know what's important to the business. The technology captures SQL queries, HTTP/TCP calls, messaging calls, and hardware and network information. OTel doesn't capture user IDs, non-generic metadata, or anything specific to your application and business.

This is where custom instrumentation comes into play. Custom instrumentation takes work and time to implement, but it gives developers the flexibility to control the capture of the information they need to troubleshoot in production.

Real-world examples

To understand how this works in practice, let's look at an online shopping cart checkout. A transaction may hit one endpoint, four endpoints, or even 10 or more endpoints. These endpoints may hit other endpoints. An application might have a Kafka backend, a message bus backend, database or NoSQL database storage, or any number of custom APIs or resources. When a customer places an order, the system runs all of these business-specific applications and services related to order processing, billing, marketing, and fulfillment.

So, when many users check out through their shopping carts, how can you be sure that when a customer clicks the buy button, it will correctly trigger the completion of order processing, purchasing, shipping, billing, and anything else that is needed? And most importantly, how do you know everyone is being billed correctly?

By customizing the instrument, OTel enables you to link all of these different applications together and get a holistic view of the entire business transaction across all these different services.

OTel acts as a bridge between DevOps, connecting the network traffic data and sources that your development team is monitoring internally with the code your development team is watching. This granular observability enables DevOps to quickly troubleshoot and resolve issues and ensure that applications and business processes are optimized and accurate.

Custom instrumentation also enables applications to capture business-specific telemetry data that is critical to your DevOps team ensuring a great user experience.

To use OTel or not to use OTel

Companies with large APM deployments may already have highly skilled developers on staff who can leverage OTels and custom instrumentation to improve DevOps efficiency.

If your developers don't have the ability to do custom instruments, it might be worth letting them learn. You can incrementally embed custom OTel instruments in your application, which will spread time and cost across the entire development cycle.

Existing APM users need to consider more factors, starting with the breadth of their APM deployment. When you're monitoring thousands of apps, adding OTel functionality is undoubtedly more complicated, and the charges can be seen as prohibitive. These companies can test the benefits of OTel-enhanced APM in a subset of their applications, or use low-cost, open-source monitoring alternatives in development or general availability environments.

OpenTelemetry for DevOps:下一步

The goal of OTel is to standardize the collection and export of telemetry data so that organizations have the flexibility to choose their back-end APM or observability solutions. With the addition of support for analytics, which dynamically checks the behavior and performance of application code at runtime, the OpenTelemetry project is expanding its capabilities to match commercial products.

Continuous analytics provides insight into code-level resource utilization and allows analytics data to be stored, queried, and analyzed over time and across different attributes. This data enables developers and operations staff to correlate a resource-exhausted or poor user experience across services with the specific service or pod affected and the function or line of code that caused it.

Whether your business is large or small, new to APM or a broad APM user, OTel can help you deliver on the promise of observability with minimal additional code or effort.

Read on