现象
新搭建的graylog,同时已经做了一些简单的性能优化,早期直接基于内置的随机http 消息生成压测,稳定性以及写入还是很不错的,而且都是新机器没啥压力
但是经过一段时间之后发现如果有大量的日志写入的时候吞吐上不去,而且过一段时间日志数据会彻底写入不成功了,graylog 集群cpu 使用率比较高,依赖的
es基本没啥负载,当时系统的内存使用也没啥问题,graylog 居然也没啥异常日志
排查
因为是cpu 使用比较高,而且属于jvm 应用我们可选的方法就很多了,比如arthas,jprofiler,jcmd。。。。。,个人比较喜欢arthas,首先进入系统安装arthas
通过
thread -n 5
看看top 5 线程干的事情,发现如下异常信息 (cpu 使用率比较高,而且执行时间也很长,而且还是正则的处理,正则居然是贪婪匹配)
"processbufferprocessor-9" Id=63 cpuUsage=99.94% deltaTime=202ms time=1232842ms RUNNABLE
at [email protected]/java.util.regex.Pattern$CharPropertyGreedy.match(Pattern.java:4314)
at [email protected]/java.util.regex.Pattern$Start.match(Pattern.java:3619)
at [email protected]/java.util.regex.Matcher.search(Matcher.java:1729)
at [email protected]/java.util.regex.Matcher.find(Matcher.java:746)
at app//org.graylog.plugins.pipelineprocessor.functions.strings.RegexMatch.evaluate(RegexMatch.java:63)
at app//org.graylog.plugins.pipelineprocessor.functions.strings.RegexMatch.evaluate(RegexMatch.java:37)
at app//org.graylog.plugins.pipelineprocessor.ast.expressions.FunctionExpression.evaluateUnsafe(FunctionExpression.java:63)
at app//org.graylog.plugins.pipelineprocessor.ast.expressions.FieldAccessExpression.evaluateUnsafe(FieldAccessExpression.java:48)
at app//org.graylog.plugins.pipelineprocessor.ast.expressions.EqualityExpression.evaluateBool(EqualityExpression.java:47)
at app//org.graylog.plugins.pipelineprocessor.processors.PipelineInterpreter.evaluateRuleCondition(PipelineInterpreter.java:425)
at app//org.graylog.plugins.pipelineprocessor.processors.PipelineInterpreter.evaluateStage(PipelineInterpreter.java:315)
at app//org.graylog.plugins.pipelineprocessor.processors.PipelineInterpreter.processForResolvedPipelines(PipelineInterpreter.java:277)
at app//org.graylog.plugins.pipelineprocessor.processors.PipelineInterpreter.process(PipelineInterpreter.java:149)
at app//org.graylog.plugins.pipelineprocessor.processors.PipelineInterpreter.process(PipelineInterpreter.java:105)
at app//org.graylog2.shared.buffers.processors.ProcessBufferProcessor.handleMessage(ProcessBufferProcessor.java:158)
at app//org.graylog2.shared.buffers.processors.ProcessBufferProcessor.dispatchMessage(ProcessBufferProcessor.java:128)
at app//org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:98)
at app//org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:49)
at app//com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143)
at app//com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66)
at [email protected]/java.lang.Thread.run(Thread.java:829)
原因
刚开始的时候花了不少时间,同时进行input 组件以及graylog server的调优,比如(调整处理线程的大小,调整jvm 内存)
发现没啥效果,然后仔细看了日志,发现使用了PipelineInterpreter (pipeline 是graylog 的一个能力,可以进行stream的加工处理)
基本原因可以确定了,然后就是看看添加的pipeline 做的事情了,大致看了的确有不少正则处理的,而且是比较慢(因为时间格式不对)