概述
CDN産生大量日志,可以進行
日志轉存到OSS bucket中。也可以進行
實時日志推送, 推送到日志服務SLS中進行實時分析,也可以在日志服務SLS中配置OSS日志投遞,将海量CDN日志長期儲存在OSS中。在OSS中的海量CDN日志,利用Data Lake Analytics進行分析。
![](https://img.laitimes.com/img/__Qf2AjLwojIjJCLyojI0JCLicGcq5SZhR2M0YWY3gDO1ITN0kDO3EWM0ETY0gzM3Q2NkRmZx8CX5d2bs92Yl1iclB3bsVmdlR2LcNWaw9CXt92Yu4GZjlGbh5yYjV3Lc9CX6MHc0RHaiojIsJye.jpg)
- CDN日志到日志服務的實時日志推送可以參考: https://help.aliyun.com/document_detail/121169.html
- 日志服務投遞OSS鍊路和DLA的OSS日志分析可以參考: https://yq.aliyun.com/articles/705888
假設CDN的原始日志(不是進過日志服務投遞OSS鍊路投遞到OSS的,如果是日志服務投遞到OSS的,請參考
), 通過轉存、上傳,直接上傳到OSS,文本下面部分介紹在DLA中建表分析的步驟。
步驟一:針對OSS上CDN日志建表
假設您已經将CDN日志傳到OSS上,每行日志如下示例:
[9/Jun/2015:01:58:09 +0800] 188.165.15.75 - 1542 "-" "GET http://www.aliyun.com/index.html" 200 191 2830 MISS "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "text/html"
其中每個字段含義為:
在DLA中建表,采用正規表達式的方式對CDN日志進行解析,正規表達式是對上述字段進行分别識别,每個字段和對應的正規表達式片段為:
為友善您複制粘貼,上述對應字段的樣例和對應的正規表達式片段對應如下:
[9/Jun/2015:01:58:09 +0800]
-> (-|\\[[^\\]]*\\])
188.165.15.75
-> ([^ ]*)
-
-> ([^ ]*)
1542
-> ([^ ]*)
"-"
-> ([^ \"]*|\"[^\"]*\")
"GET http://www.aliyun.com/index.html"
-> ([^ \"]*|\"[^\"]*\")
200
-> (-|[0-9]*)
191
-> (-|[0-9]*)
2830
-> (-|[0-9]*)
MISS
-> ([^ ]*)
"Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)"
-> ([^ \"]*|\"[^\"]*\")
"text/html"
-> ([^ \"]*|\"[^\"]*\")
當然,上述字段的順序可能根據實際情況有些變化,不過沒有關系,可以根據上述字段對應正則進行順序調整組合。
對于上述樣例日志,對應的正規表達式為:
(-|\\[[^\\]]*\\]) ([^ ]*) ([^ ]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (-|[0-9]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\")
假設在OSS上,存儲的CDN日志目錄為:
oss://your_cdn_log_bucket/log/
則在DLA中建表:
CREATE EXTERNAL TABLE cdn_log (
log_timestamp VARCHAR,
access_ip VARCHAR,
proxy_ip VARCHAR,
response_time VARCHAR,
referer VARCHAR,
request VARCHAR,
httpcode SMALLINT,
request_size BIGINT,
response_size BIGINT,
cache_hit_status VARCHAR,
ua_header VARCHAR,
file_type VARCHAR
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(-|\\[[^\\]]*\\]) ([^ ]*) ([^ ]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (-|[0-9]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\")"
)
STORED AS TEXTFILE
LOCATION 'oss://your_cdn_log_bucket/log/'
TBLPROPERTIES ('recursive.directories' = 'true');
上述建表為非分區表,如果在OSS日志資料存儲為分區模式,可以對應修改為分區表模式,關于分區,可以參考:
中的“步驟四”和“步驟五”。
步驟二:查詢、分析CDN日志
SELECT * FROM cdn_log;
->
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+
| log_timestamp | access_ip | proxy_ip | response_time | referer | request | httpcode | request_size | response_size | cache_hit_status | ua_header | file_type |
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+
| [18/Jun/2019:05:08:33 +0800] | 47.92.115.203 | - | 777 | "-" | "GET http://www.kalabandha.com/" | 200 | 201 | 7159 | MISS | "Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D))" | "text/html;charset=UTF-8" |
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+
可以利用DLA的系統函數,對資料字段進行分析:
https://help.aliyun.com/document_detail/71065.html例如:
SELECT date_parse(substring(log_timestamp, 2, length(log_timestamp) - 8), '%d/%b/%Y:%H:%i:%s') as log_timestamp,
access_ip,
proxy_ip,
response_time,
substring(referer, 2, length(referer) - 2) as referer,
substring(request, 2, length(request) - 2) as request,
httpcode,
request_size,
response_size,
cache_hit_status,
substring(ua_header, 2, length(ua_header) - 2) as ua_header,
substring(file_type, 2, length(file_type) - 2) as file_type
FROM cdn_log;
->
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+
| log_timestamp | access_ip | proxy_ip | response_time | referer | request | httpcode | request_size | response_size | cache_hit_status | ua_header | file_type |
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+
| 2019-06-18 05:08:33.000 | 47.92.115.203 | - | 777 | - | GET http://www.kalabandha.com/ | 200 | 201 | 7159 | MISS | Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D)) | text/html;charset=UTF-8 |
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+