天天看点

ELK——Logstash 2.2 mutate 插件【翻译+实践】 http://www.bdata-cap.com/newsinfo/1712678.html

官网地址

本文内容

  • 语法
  • 测试数据
  • 可选配置项

mutate 插件可以在字段上执行变换,包括重命名、删除、替换和修改。这个插件相当常用。

比如:

  • 你已经根据 Grok 表达式将 Tomcat 日志的内容放到各个字段中,想把状态码、字节大小或是响应时间,转换成整型;
  • 你已经根据正则表达式将日志内容放到各个字段中,但是字段的值,大小写都有,这对于 Elasticsearch 的全文检索来说,显然用处不大,那么可以用该插件,将字段内容全部转换成小写。

迁移到:http://www.bdata-cap.com/newsinfo/1712678.html

该插件必须是用 mutate 包裹,如下所示:

mutate {}      

可用的配置选项如下表所示:

设置 输入类型 是否必填 默认值
add_field hash No {}
add_tag array []
convert
gsub
join
lowercase
merge
periodic_flush boolean false
remove_field
remove_tag
rename
replace
split
strip
update
uppercase

其中,add_field、remove_field、add_tag、remove_tag 是所有 Logstash 插件都有。它们在插件过滤成功后生效。虽然 Logstash 叫过滤,但不仅仅过滤功能。

tag 作用是,当你对字段处理期间,还期望进行后续处理,就先作个标记。Logstash 有个内置 tags 数组,包含了期间产生的 tag,无论是 Logstash 自己产生的,还是你添加的,比如,你用 grok 解析日志,但是错了,那么 Logstash 自己就会自己添加一个 _grokparsefailure 的 tag。这样,你在 output 时,可以对解析失败的日志不做任何处理;

而 field 作用是,对字段的操作,比如,你想利用已有的字段,创建新的字段。这些在后面再说。

另外,你会发现,上表中所有选项,要么是动词,要么是动宾短语。估计你也猜到了,选项其实就是 ruby 函数,而它们后面,即“=>”,跟着的肯定是一堆参数(要是你写程序,你也会这么干)。第一个参数,肯定是字段,也就是你期望该函数作用在哪个字段上,从第二个字段开始往后,是具体参数~

什么是字段?比如,你想解析 Tomcat 日志,把一行访问日志拆分后,得到客户端IP、字节大小、响应时间等放到指定变量,那么这个变量就是字段。

下面具体介绍各个选项。

假设有 Tomcat access 日志:

192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET "/goLogin" "" 8080 200 1692 23 "http://10.1.8.193:8080/goMain" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0"      
192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET "/js/common/jquery-1.10.2.min.js" "" 8080 304 - 67 "http://10.1.8.193:8080/goLogin" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0"      
192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET "/css/common/login.css" "" 8080 304 - 75 "http://10.1.8.193:8080/goLogin" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0"      
192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET "/js/system/login.js" "" 8080 304 - 53 "http://10.1.8.193:8080/goLogin" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0"      

它是按如下 Tomcat 配置产生的:

<Valve className="org.apache.catalina.valves.AccessLogValve" directory="logs"      
prefix="localhost_access_log." suffix=".txt"      
pattern="%h %l %u %t %m &quot;%U&quot; &quot;%q&quot; %p %s %b %D &quot;%{Referer}i&quot; &quot;%{User-Agent}i&quot;" />      

若用如下 Grok 表达式解析该日志:

%{IPORHOST:clientip} %{NOTSPACE:identd} %{NOTSPACE:auth} \[%{HTTPDATE:timestamp}\] %{WORD:http_method} %{NOTSPACE:request} %{NOTSPACE:request_query|-} %{NUMBER:port} %{NUMBER:statusCode} (%{NOTSPACE:bytes}|-) %{NUMBER:reqTime} %{QS:referer} %{QS:userAgent}      

会得到如下结果:

{      
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/goLogin\" \"\" 8080 200 1692 23 \"http://10.1.8.193:8080/goMain\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",      
"@version" => "1",      
"@timestamp" => "2016-05-17T08:26:07.794Z",      
"host" => "vcyber",      
"clientip" => "192.168.6.25",      
"identd" => "-",      
"auth" => "-",      
"timestamp" => "24/Apr/2016:01:25:53 +0800",      
"http_method" => "GET",      
"request" => "\"/goLogin\"",      
"request_query" => "\"\"",      
"port" => "8080",      
"statusCode" => "200",      
"bytes" => "1692",      
"reqTime" => "23",      
"referer" => "\"http://10.1.8.193:8080/goMain\"",      
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\""      
}      

注意,日志拆分到各个字段后的数据类型。port、statusCode、bytes、reqTime 字段肯定是(最好是)数字,不过这里暂时先用字符串。后面会介绍,下面的示例都在此基础上。

可配置选项

  • 值是散列,就是键值对,比如 add_field => {"field1"=>"value1","field2"=>"value2"}。
  • 默认值是空对象,即

    {}

添加新的字段。

示例:

input {      
stdin {      
}      
}      
filter {      
grok {      
match=>["message","%{IPORHOST:clientip} %{NOTSPACE:identd} %{NOTSPACE:auth} \[%{HTTPDATE:timestamp}\] %{WORD:http_method} %{NOTSPACE:request} %{NOTSPACE:request_query|-} %{NUMBER:port} %{NUMBER:statusCode} (%{NOTSPACE:bytes}|-) %{NUMBER:reqTime} %{QS:referer} %{QS:userAgent}"]      
}      
mutate {      
add_field=>{      
"SayHi"=>"Hello , %{clientip}"      
}      
}      
}      
output{      
stdout{      
codec=>rubydebug      
}      
}      
注意黑体部分,如果用这个配置,解析前面的 Tcomat access 日志,会得到如下结果:      
{      
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/goLogin\" \"\" 8080 200 1692 23 \"http://10.1.8.193:8080/goMain\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",      
"@version" => "1",      
"@timestamp" => "2016-05-17T04:52:02.031Z",      
"host" => "vcyber",      
"clientip" => "192.168.6.25",      
"identd" => "-",      
"auth" => "-",      
"timestamp" => "24/Apr/2016:01:25:53 +0800",      
"http_method" => "GET",      
"request" => "\"/goLogin\"",      
"request_query" => "\"\"",      
"port" => "8080",      
"statusCode" => "200",      
"bytes" => "1692",      
"reqTime" => "23",      
"referer" => "\"http://10.1.8.193:8080/goMain\"",      
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",      
"SayHi" => "Hello , 192.168.6.25"      
}      
你会看到多了一个 SayHi 字段。这个字段是写死的,当然也可以动态。如果将      
"SayHi"=>"Hello , %{clientip}"      
改成:      
"another_%{clientip}"=>"Hello , %{clientip}"       
你会看到如下结果:      
{      
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/goLogin\" \"\" 8080 200 1692 23 \"http://10.1.8.193:8080/goMain\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",      
"@version" => "1",      
"@timestamp" => "2016-05-17T06:38:04.427Z",      
"host" => "vcyber",      
"clientip" => "192.168.6.25",      
"identd" => "-",      
"auth" => "-",      
"timestamp" => "24/Apr/2016:01:25:53 +0800",      
"http_method" => "GET",      
"request" => "\"/goLogin\"",      
"request_query" => "\"\"",      
"port" => "8080",      
"statusCode" => "200",      
"bytes" => "1692",      
"reqTime" => "23",      
"referer" => "\"http://10.1.8.193:8080/goMain\"",      
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",      
"another_192.168.6.25" => "Hello , 192.168.6.25"      
}      

虽然这个例子不太合理,但你现在知道,用已有字段的值,可以生成新的字段和它的值。上面示例只添加了一个字段,你也可以添加多个字段:

add_field=>{      
"another_%{clientip}"=>"Hello , %{clientip}"      
"another_%{http_method}"=>"Hello, %{http_method}"      
}      

  • 值是 array 数组
  • 默认值为空数组,即

    []

添加新的标签。

mutate {      
add_tag=>[      
"foo_%{clientip}"      
]      
}      
你会看到如下结果:      
{      
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/goLogin\" \"\" 8080 200 1692 23 \"http://10.1.8.193:8080/goMain\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",      
"@version" => "1",      
"@timestamp" => "2016-05-17T06:48:43.278Z",      
"host" => "vcyber",      
"clientip" => "192.168.6.25",      
"identd" => "-",      
"auth" => "-",      
"timestamp" => "24/Apr/2016:01:25:53 +0800",      
"http_method" => "GET",      
"request" => "\"/goLogin\"",      
"request_query" => "\"\"",      
"port" => "8080",      
"statusCode" => "200",      
"bytes" => "1692",      
"reqTime" => "23",      
"referer" => "\"http://10.1.8.193:8080/goMain\"",      
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",      
"tags" => [      
[0] "foo_192.168.6.25"      
]      
}      
与 add_field 类似,也可以一次添加多个 tags。      
注意,add_tag 是数组 [],不是 {}。      

  • 值是 hash
  • 无默认值

数据类型转换。

如果要转换成

boolean,那么可接受的数据是:

  • true

    ,

    t

    yes

    y

    , 和

    1

  • false

    f

    no

    n

另外,还可转换成 integer, float, string。

mutate {      
#convert=>["reqTime","integer","statusCode","integer","bytes","integer"]      
convert=>{"port"=>"integer"}      
}      

convert 有两种写法。一种是用数组,两个为一组;另一种是散列。得到如下结果:

{      
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/goLogin\" \"\" 8080 200 1692 23 \"http://10.1.8.193:8080/goMain\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",      
"@version" => "1",      
"@timestamp" => "2016-05-17T09:06:25.360Z",      
"host" => "vcyber",      
"clientip" => "192.168.6.25",      
"identd" => "-",      
"auth" => "-",      
"timestamp" => "24/Apr/2016:01:25:53 +0800",      
"http_method" => "GET",      
"request" => "\"/goLogin\"",      
"request_query" => "\"\"",      
"port" => 8080,      
"statusCode" => "200",      
"bytes" => "1692",      
"reqTime" => "23",      
"referer" => "\"http://10.1.8.193:8080/goMain\"",      
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\""      
}      
注意,
  • port 字段,已经没有双引号啦。
  • mutate 插件选项的值类型设计得很简单,要么是散列(键值对),要么数组……比如,convert=>["reqTime","integer","statusCode","integer"],两个为一组,第一个表示字段,第二个为想转换的数据类型,并没有采用嵌套或是复合类型。看来作者的意图是——简单,复杂的数据类型,虽然看起来容易,但要付出成本的。简单没关系,约定好就行。Logstash 很多插件和其选项都这样。

字符串替换。用正则表达式和字符串都行。它只能用于字符串,如果不是字符串,那么什么都不会做,也不会报错。

该配置的值是数组,三个为一组,分别表示:字段名称,待匹配的字符串(或正则表达式),待替换的字符串。

示例:在解析 Tomcat 日志,会遇到一种情况,资源的字节大小,可能会是“-”,因此,需要将“-”,替换成0,然后在用convert转换成数字型。

input {      
stdin {      
}             
}      
filter {      
grok {      
match=>["message","%{IPORHOST:clientip} %{NOTSPACE:identd} %{NOTSPACE:auth} \[%{HTTPDATE:timestamp}\] %{WORD:http_method} %{NOTSPACE:request} %{NOTSPACE:request_query|-} %{NUMBER:port} %{NUMBER:statusCode} (%{NOTSPACE:bytes}|-) %{NUMBER:reqTime} %{QS:referer} %{QS:userAgent}"]      
}      
mutate {      
gsub=>["bytes","_","0"]      
convert=>["port","integer","reqTime","integer","statusCode","integer","bytes","integer"]      
}      
}      
output{      
stdout{      
codec=>rubydebug      
}      
}      

得到如下结果:

{      
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/js/common/jquery-1.10.2.min.js\" \"\" 8080 304 - 67 \"http://10.1.8.193:8080/goLogin\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",      
"@version" => "1",      
"@timestamp" => "2016-05-17T09:17:21.745Z",      
"host" => "vcyber",      
"clientip" => "192.168.6.25",      
"identd" => "-",      
"auth" => "-",      
"timestamp" => "24/Apr/2016:01:25:53 +0800",      
"http_method" => "GET",      
"request" => "\"/js/common/jquery-1.10.2.min.js\"",      
"request_query" => "\"\"",      
"port" => 8080,      
"statusCode" => 304,      
"bytes" => 0,      
"reqTime" => 67,      
"referer" => "\"http://10.1.8.193:8080/goLogin\"",      
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\""      
}      

用分隔符连接数组. 如果字段不是数组,那什么都不做。

filter {
  mutate {
    join =>{"fieldname"=>","}}}      

lowercase 和 uppercase

  • 值是数组 array
  • 没有默认值

把字符串转换成小写或大写。

filter {      
mutate {      
lowercase =>["fieldname"]}}      
filter {      
mutate {      
uppercase =>["fieldname"]}}      

合并两个数组或散列字段。存在三种情况,合并后是数组:

  • 数组和字符串,可以合并
  • 字符串和字符串,可以合并
  • 数组和散列不能合并
mutate {      
add_field=>{"arr_clientip"=>"%{clientip}"}      
add_field=>{"arrmstr_clientip"=>"%{clientip}"}      
add_field=>{"arrmarr_clientip"=>"%{clientip}"}      
#merge=>{"merge_clientip"=>"clientip"}      
}      
mutate {      
split=>{"arr_clientip"=>"."}      
split=>{"arrmstr_clientip"=>"."}      
split=>{"arrmarr_clientip"=>"."}      
}      
mutate {      
merge=>{"arrmstr_clientip"=>"clientip"}      
merge=>{"arrmarr_clientip"=>"arr_clientip"}      
}      
=> 后面的字段值会合并到前面的字段。      
得到如下结果:      
{      
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/goLogin\" \"\" 8080 200 1692 23 \"http://10.1.8.193:8080/goMain\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",      
"@version" => "1",      
"@timestamp" => "2016-05-18T02:53:35.671Z",      
"host" => "vcyber",      
"clientip" => "192.168.6.25",      
"identd" => "-",      
"auth" => "-",      
"timestamp" => "24/Apr/2016:01:25:53 +0800",      
"http_method" => "GET",      
"request" => "\"/goLogin\"",      
"request_query" => "\"\"",      
"port" => "8080",      
"statusCode" => "200",      
"bytes" => "1692",      
"reqTime" => "23",      
"referer" => "\"http://10.1.8.193:8080/goMain\"",      
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",      
"arr_clientip" => [      
[0] "192",      
[1] "168",      
[2] "6",      
[3] "25"      
],      
"arrmstr_clientip" => [      
[0] "192",      
[1] "168",      
[2] "6",      
[3] "25",      
[4] "192.168.6.25"      
],      
"arrmarr_clientip" => [      
[0] "192",      
[1] "168",      
[2] "6",      
[3] "25",      
[4] "192",      
[5] "168",      
[6] "6",      
[7] "25"      
]      
}      

  • 值是 boolean
  • 默认值是

    false

按时间间隔调用。可选。

  • 默认值是数组

    []

移除字段。

示例:移除 message 字段。

mutate {      
remove_field=>["message"]      
}      
{      
"@version" => "1",      
"@timestamp" => "2016-05-18T02:04:16.879Z",      
"host" => "vcyber",      
"clientip" => "192.168.6.25",      
"identd" => "-",      
"auth" => "-",      
"timestamp" => "24/Apr/2016:01:25:53 +0800",      
"http_method" => "GET",      
"request" => "\"/goLogin\"",      
"request_query" => "\"\"",      
"port" => "8080",      
"statusCode" => "200",      
"bytes" => "1692",      
"reqTime" => "23",      
"referer" => "\"http://10.1.8.193:8080/goMain\"",      
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\""      
}      

message 字段已经没有了~message 字段保存了原始日志,如果保留的话,就意味着日志存了两份:分割前和分割后。

当然,也可以一次移除多个字段。

  • []

移除标识。

filter {      
mutate {      
remove_tag =>["foo_%{somefield}"]}}      

也可以一次移动多个 tag:

filter {      
mutate {      
remove_tag =>["foo_%{somefield}","sad_unwanted_tag"]}}      

重命名一个或多个字段。

示例:

input {      
stdin {      
}             
}      
filter {      
grok {      
match=>["message","%{IPORHOST:clientip} %{NOTSPACE:identd} %{NOTSPACE:auth} \[%{HTTPDATE:timestamp}\] %{WORD:http_method} %{NOTSPACE:request} %{NOTSPACE:request_query|-} %{NUMBER:port} %{NUMBER:statusCode} (%{NOTSPACE:bytes}|-) %{NUMBER:reqTime} %{QS:referer} %{QS:userAgent}"]      
}      
mutate {      
rename=>{"clientip"=>"host"}      
}      
}      
output{      
stdout{      
codec=>rubydebug      
}      
}      
{      
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/goLogin\" \"\" 8080 200 1692 23 \"http://10.1.8.193:8080/goMain\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",      
"@version" => "1",      
"@timestamp" => "2016-05-17T09:29:44.018Z",      
"host" => "192.168.6.25",      
"identd" => "-",      
"auth" => "-",      
"timestamp" => "24/Apr/2016:01:25:53 +0800",      
"http_method" => "GET",      
"request" => "\"/goLogin\"",      
"request_query" => "\"\"",      
"port" => "8080",      
"statusCode" => "200",      
"bytes" => "1692",      
"reqTime" => "23",      
"referer" => "\"http://10.1.8.193:8080/goMain\"",      
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\""      
}      

Grok 里,客户端IP本来叫 clientip,但是可以在 mutate 里重新命名为 host。

用一个新的值替换掉指定字段的值。

input {      
stdin {      
}             
}      
filter {      
grok {      
match=>["message","%{IPORHOST:clientip} %{NOTSPACE:identd} %{NOTSPACE:auth} \[%{HTTPDATE:timestamp}\] %{WORD:http_method} %{NOTSPACE:request} %{NOTSPACE:request_query|-} %{NUMBER:port} %{NUMBER:statusCode} (%{NOTSPACE:bytes}|-) %{NUMBER:reqTime} %{QS:referer} %{QS:userAgent}"]      
}      
mutate {      
replace=>{"message"=>"%{clientip}: My new Message."}      
}      
}      
output{      
stdout{      
codec=>rubydebug      
}      
}      
{      
"message" => "192.168.6.25: My new Message.",      
"@version" => "1",      
"@timestamp" => "2016-05-18T01:55:34.566Z",      
"host" => "vcyber",      
"clientip" => "192.168.6.25",      
"identd" => "-",      
"auth" => "-",      
"timestamp" => "24/Apr/2016:01:25:53 +0800",      
"http_method" => "GET",      
"request" => "\"/goLogin\"",      
"request_query" => "\"\"",      
"port" => "8080",      
"statusCode" => "200",      
"bytes" => "1692",      
"reqTime" => "23",      
"referer" => "\"http://10.1.8.193:8080/goMain\"",      
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\""      
}      

message 字段的值已经变了。

用分隔符或字符分割一个字符串。只能应用在字符串上。

示例:把客户端IP按英文句号分割成数组。

mutate {      
split=>{"clientip"=>"."}      
}      
得到如下结果:      
{      
"message" => "192.168.6.25 - - [24/Apr/2016:01:25:53 +0800] GET \"/goLogin\" \"\" 8080 200 1692 23 \"http://10.1.8.193:8080/goMain\" \"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\"",      
"@version" => "1",      
"@timestamp" => "2016-05-18T01:58:40.687Z",      
"host" => "vcyber",      
"clientip" => [      
[0] "192",      
[1] "168",      
[2] "6",      
[3] "25"      
],      
"identd" => "-",      
"auth" => "-",      
"timestamp" => "24/Apr/2016:01:25:53 +0800",      
"http_method" => "GET",      
"request" => "\"/goLogin\"",      
"request_query" => "\"\"",      
"port" => "8080",      
"statusCode" => "200",      
"bytes" => "1692",      
"reqTime" => "23",      
"referer" => "\"http://10.1.8.193:8080/goMain\"",      
"userAgent" => "\"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0\""      
}      

去掉字段首尾的空格。

filter {      
mutate {      
strip =>["field1","field2"]}}      

Update an existing field with a new value. If the field does not exist, then no action will be taken.

filter {
  mutate {
    update =>{"sample"=>"My new message"}}}      

继续阅读