天天看點

先收藏!海量CDN日志高成本效益分析方案

概述

CDN産生大量日志,可以進行

日志轉存

到OSS bucket中。也可以進行

實時日志推送

, 推送到日志服務SLS中進行實時分析,也可以在日志服務SLS中配置OSS日志投遞,将海量CDN日志長期儲存在OSS中。在OSS中的海量CDN日志,利用Data Lake Analytics進行分析。

先收藏!海量CDN日志高成本效益分析方案

假設CDN的原始日志(不是進過日志服務投遞OSS鍊路投遞到OSS的,如果是日志服務投遞到OSS的,請參考

), 通過轉存、上傳,直接上傳到OSS,文本下面部分介紹在DLA中建表分析的步驟。

步驟一:針對OSS上CDN日志建表

假設您已經将CDN日志傳到OSS上,每行日志如下示例:

[9/Jun/2015:01:58:09 +0800] 188.165.15.75 - 1542 "-" "GET http://www.aliyun.com/index.html" 200 191 2830 MISS "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "text/html"
           

其中每個字段含義為:

先收藏!海量CDN日志高成本效益分析方案

在DLA中建表,采用正規表達式的方式對CDN日志進行解析,正規表達式是對上述字段進行分别識别,每個字段和對應的正規表達式片段為:

先收藏!海量CDN日志高成本效益分析方案

為友善您複制粘貼,上述對應字段的樣例和對應的正規表達式片段對應如下:

[9/Jun/2015:01:58:09 +0800] 
-> (-|\\[[^\\]]*\\])

188.165.15.75
-> ([^ ]*)

-
-> ([^ ]*) 

1542 
-> ([^ ]*)

"-"
-> ([^ \"]*|\"[^\"]*\")

"GET http://www.aliyun.com/index.html"
-> ([^ \"]*|\"[^\"]*\")

200 
-> (-|[0-9]*) 

191 
-> (-|[0-9]*)

2830
-> (-|[0-9]*)

MISS
-> ([^ ]*)

"Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)"
-> ([^ \"]*|\"[^\"]*\")

"text/html"
-> ([^ \"]*|\"[^\"]*\")
           

當然,上述字段的順序可能根據實際情況有些變化,不過沒有關系,可以根據上述字段對應正則進行順序調整組合。

對于上述樣例日志,對應的正規表達式為:

(-|\\[[^\\]]*\\]) ([^ ]*) ([^ ]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (-|[0-9]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\")
           

假設在OSS上,存儲的CDN日志目錄為:

oss://your_cdn_log_bucket/log/

則在DLA中建表:

CREATE EXTERNAL TABLE cdn_log (
  log_timestamp           VARCHAR,
  access_ip               VARCHAR,
  proxy_ip                VARCHAR,
  response_time           VARCHAR,
  referer                 VARCHAR,
  request                 VARCHAR,
  httpcode                SMALLINT,
  request_size            BIGINT,
  response_size           BIGINT,
  cache_hit_status        VARCHAR,
  ua_header               VARCHAR,
  file_type               VARCHAR
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "(-|\\[[^\\]]*\\]) ([^ ]*) ([^ ]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (-|[0-9]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\")"
)
STORED AS TEXTFILE
LOCATION 'oss://your_cdn_log_bucket/log/'
TBLPROPERTIES ('recursive.directories' = 'true');
           

上述建表為非分區表,如果在OSS日志資料存儲為分區模式,可以對應修改為分區表模式,關于分區,可以參考:

中的“步驟四”和“步驟五”。

步驟二:查詢、分析CDN日志

SELECT * FROM cdn_log;

-> 
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+
| log_timestamp                | access_ip     | proxy_ip | response_time | referer | request                          | httpcode | request_size | response_size | cache_hit_status | ua_header                                                   | file_type                 |
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+
| [18/Jun/2019:05:08:33 +0800] | 47.92.115.203 | -        | 777           | "-"     | "GET http://www.kalabandha.com/" |      200 |          201 |          7159 | MISS             | "Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D))" | "text/html;charset=UTF-8" |
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+
           

可以利用DLA的系統函數,對資料字段進行分析:

https://help.aliyun.com/document_detail/71065.html

例如:

SELECT date_parse(substring(log_timestamp, 2, length(log_timestamp) - 8), '%d/%b/%Y:%H:%i:%s') as log_timestamp,
       access_ip,
       proxy_ip,
       response_time,
       substring(referer, 2, length(referer) - 2) as referer,
       substring(request, 2, length(request) - 2) as request,
       httpcode,
       request_size,
       response_size,
       cache_hit_status,
       substring(ua_header, 2, length(ua_header) - 2) as ua_header,
       substring(file_type, 2, length(file_type) - 2) as file_type 
FROM cdn_log;

->

+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+
| log_timestamp           | access_ip     | proxy_ip | response_time | referer | request                        | httpcode | request_size | response_size | cache_hit_status | ua_header                                                 | file_type               |
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+
| 2019-06-18 05:08:33.000 | 47.92.115.203 | -        | 777           | -       | GET http://www.kalabandha.com/ |      200 |          201 |          7159 | MISS             | Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D)) | text/html;charset=UTF-8 |
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+