SLS时序监控实战: Spring Boot应用监控最佳实践

前言

当今随着云原生和微服务的盛行, 我们的应用的运行环境也变得越来越复杂, 也使得我们越来越难以掌握它的运行状态, 也因此诞生了一批开源软件来帮助我们提升应用的可观察性, 例如prometheus, grafana, open tracing, open telementry等, 这些多半是比较通用的技术, 在实际的场景下, 我们需要怎么从各个层面来做监控和数据的分析呢, 我们就以大家使用最多的技术栈: Java + Spring Boot为例, 来详细阐述应用监控的最佳实践

监控软件选型

采集, 存储, 查询: Prometheus

在云原生领域, Prometheus几乎已经是监控的标准, 它也有着丰富的生态, 支持从操作系统, 数据库, 到各类中间件的监控, 对各种语言也都提供了相应的SDK, 关于Prometheus的介绍我们不再赘述. 本文就选择Prometheus作为基础的监控软件, 围绕它展开监控最佳实践

但Prometheus也有它的问题: 如数据无法长时间存储, 大数据量下无法拓展, 只能单机运行等

而SLS的时序存储恰好解决了这些问题, 支持PromQL, 并且支持Prometheus的API, 可以直接替换开源的Prometheus的存储查询层

可视化: Grafana

Prometheus自身提供的可视化能力很弱, 因此通常会选择Grafana对接Prometheus去配置dashboard, 这两个可以说是最佳搭档

通知: alertmanager

alertmanager是Prometheus提供的一个组件, 它提供了丰富的通知路由, 发送方式

监控分层

按监控对象分:

操作系统监控(OS)

中间件/数据库监控(MySQL, Kafka, ES, Hbase, etc)

应用监控(JVM, Tomcat, Spring)

业务监控(SDK, Log)

按数据流分:

暴露指标 -> 采集 -> 存储 -> 可视化 -> 分析                                          
                        -> 报警

主机监控

主机监控, 即操作系统层面的监控有很多种姿势:

Prometheus官方有node_exporter, 如果你是非阿里云的机器, 跑的是主流的Linux, 那么这是比较推荐的方式来暴露指标
我们的Logtail也提供了主机监控的插件, 可以暴露绝大多数常用指标, 如果是阿里云的机器, 那么这种方式操作起来也非常简单, 可参考文档: 采集主机监控数据_数据接入_时序存储_日志服务-阿里云
同时假如你已经使用了telegraf, 那么也可以使用telegraf来接入SLS
云上ECS已有云监控, 因此也可以直接导入云监控数据, 不重复采集: 导入云监控数据_数据接入_时序存储_日志服务-阿里云
, 由于云监控只采集一些核心指标, 因此这种方式获得的指标会较少

SLS以尽量开放的姿态, 兼容各种开源软件和协议, 帮助你以最舒服的方式接入数据

中间件/数据库监控

各类主流的开源中间件通常也都和开源的监控软件结合的比较好, 例如MySQL可以使用telegraf进行采集, Prometheus exporter对大部分开源软件的支持也都比较好:

Exporters and integrations | Prometheus

同主机监控, 如果你使用的是阿里云提供的数据库或中间件, 那很可能云监控上已经有相关的指标, 因此也可以选择导入云监控数据

应用监控

Spring Boot Actuator

Spring Boot作为最主流的Java Web框架, 自然也少不了对监控的支持, 那就是Actuator, 要使用Actuator需要先添加依赖:

<dependency>
      <groupId>org.springframework.boot</groupId>
      <artifactId>spring-boot-starter-actuator</artifactId>
  </dependency>

Actuator默认提供了13个接口:

这些接口默认只开放了

/heath

和

/info

, 可以修改配置打开其他的接口:

management:
  endpoints:
    web:
      exposure:
        include: '*'

在Spring Boot 2.0以上, 它使用了micrometer作为底层的度量工具, micrometer是监控度量的门面, 相当于slf4j在日志框架中的作用,它能支持按照各种格式来暴露数据, 其中就有Prometheus.

我们引入一个很小的依赖来暴露Prometheus数据:

<dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-registry-prometheus</artifactId>
            <version>1.1.3</version>
</dependency>

这个依赖的作用就是会开启一个endpoint, 输出兼容Prometheus exporter的结果, 方便Prometheus来采集

同时记得修改spring boot配置:

server:
  port: 8080
spring:
  application:
    name: spring-demo
management:
  endpoints:
    web:
      exposure:
        include: 'prometheus' # 暴露/actuator/prometheus
  metrics:
    tags:
      application: ${spring.application.name} # 暴露的数据中添加application label

然后启动应用, 访问

http://localhost:8080/actuator/prometheus

应该会得到如下结果:

# HELP jvm_memory_committed_bytes The amount of memory in bytes that is committed for the Java virtual machine to use
# TYPE jvm_memory_committed_bytes gauge
jvm_memory_committed_bytes{application="spring-demo",area="heap",id="PS Eden Space",} 1.77733632E8
jvm_memory_committed_bytes{application="spring-demo",area="nonheap",id="Metaspace",} 3.6880384E7
jvm_memory_committed_bytes{application="spring-demo",area="heap",id="PS Old Gen",} 1.53092096E8
jvm_memory_committed_bytes{application="spring-demo",area="heap",id="PS Survivor Space",} 1.4680064E7
jvm_memory_committed_bytes{application="spring-demo",area="nonheap",id="Compressed Class Space",} 5160960.0
jvm_memory_committed_bytes{application="spring-demo",area="nonheap",id="Code Cache",} 7798784.0
# HELP jvm_classes_unloaded_classes_total The total number of classes unloaded since the Java virtual machine has started execution
# TYPE jvm_classes_unloaded_classes_total counter
jvm_classes_unloaded_classes_total{application="spring-demo",} 0.0
# HELP jvm_memory_max_bytes The maximum amount of memory in bytes that can be used for memory management
jvm_memory_max_bytes{application="spring-demo",area="nonheap",id="Code Cache",} 2.5165824E8
# HELP jvm_classes_loaded_classes The number of classes that are currently loaded in the Java virtual machine
# TYPE jvm_classes_loaded_classes gauge
jvm_classes_loaded_classes{application="spring-demo",} 7010.0
# HELP jvm_threads_daemon_threads The current number of live daemon threads
# TYPE jvm_threads_daemon_threads gauge
jvm_threads_daemon_threads{application="spring-demo",} 24.0
# HELP jvm_threads_states_threads The current number of threads having NEW state

# 太长, 后面省略

这就是Prometheus exporter的格式

JVM监控

我们看到里面暴露了很详细的jvm的指标, 这样我们就可以来配置jvm监控了

首先Prometheus需要增加对

http://localhost:8080/actuator/prometheus

的采集, 我们修改一下配置:

global:
  scrape_interval: 15s
scrape_configs:
  - job_name: "spring-demo"
    metrics_path: "/actuator/prometheus"
    static_configs:
    - targets: ["localhost:8080"]

启动Prometheus, 没报错的话应该就已经在正常采集了, 我们访问prometheus的web ui看一下数据:

http://localhost:9090/graph

看到这样的结果说明数据采集正常, 写入Prometheus后通过他的remote write协议可以把数据持久化到SLS中, 可参见文档:

采集Prometheus监控数据

然后可以把SLS时序库作为Grafana的数据源来配置, 同样有相关文档:

时序数据对接Grafan

那么一切准备就绪了, 我们就可以开始配置dashboard了, 我们在grafana.com上传了一个模板dashboard, 直接在grafana中导入即可:

选择+ -> Import -> 粘贴url:

https://grafana.com/grafana/dashboards/12856

然后选择上面创建的Prometheus数据源, 即可导入, 完整的效果如下:

大家在使用SLS中遇到的任何问题，请加钉钉群，我们有专门的日志女仆24小时在线答疑，还有火锅哥和烧烤哥专业支持！~ SLS微信公众号定期会发布各类日志、监控领域的技术分享文章并定期举行抽奖，欢迎小伙伴们关注~

另外欢迎对大数据、分布式、机器学习等有兴趣的同学加入，转岗、内推，来者不拒，请用简历狠狠的砸我，联系邮箱[email protected] ！~

SLS时序监控实战: Spring Boot应用监控最佳实践

监控软件选型

采集, 存储, 查询: Prometheus

可视化: Grafana

通知: alertmanager

监控分层

主机监控

中间件/数据库监控

Spring Boot Actuator

JVM监控

继续阅读

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的简单使用

neo4j之cypher使用文档

GitHub连夜封杀！这份阿里 10W 字内部 Java 字面试手册到底有多强？

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

NOSQL安全攻击

mybatis_入门程序Mybatis入门

AOP编程_Android优雅权限框架(1)概念基础，2021金三银四前言正文大纲正文

登录plsql 报错 the account is locked --用户被锁

Effective Java 8:通用程序设计

SequoiaDB巨杉数据库C++驱动概述

OOM三种类型

工厂模式-三种类型

【递归】高效率求2的n次幂

win10本地scala和spark安装安装scala安装spark

scala (3) Function 和 Method