轉載請注明出處:https://blog.csdn.net/l1028386804/article/details/98945268
一、環境準備
首先,有關Hadoop環境的搭建,大家可以參考博文《Hadoop之——基于3台伺服器搭建Hadoop3.x叢集(實測完整版)》,有關Nginx的安裝和配置,可以參見博文《Nginx+Tomcat+Memcached負載均衡叢集服務搭建》,有關Hive的安裝和配置,可以參見博文《Hive之——Hive2.3.4 安裝和配置》和《Hive之——hive本地模式配置,連接配接mysql資料庫--Hive2.3.3+Hadoop2.9.0+MySQL5.7.18》。
Flume的安裝比較簡單,下載下傳Flume後,解壓,配置系統環境變量即可。下載下傳Flume可以輸入下面的指令。
wget http://mirror.bit.edu.cn/apache/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz
二、啟動服務
#啟動Hadoop
start-dfs.sh
start-yarn.sh
#啟動Nginx
/usr/local/nginx/sbin/nginx
#啟動Hive指令行
hive
三、建立Hive資料庫和表
1.輸入如下指令建立名為hive_nginx_log的資料庫。
hive> create database hive_nginx_log;
2.檢視Nginx的日志格式,如下所示。
192.168.175.10 - - [31/Jul/2019:21:19:39 +0800] "GET /test/sharding HTTP/1.1" 200 798 "http://192.168.175.200/" "Mozilla/5.0 (Windows NT 10.0; WOW64) Forefix/537.36 (KHTML, like Gecko) Forefix/65.0.3325.181 Forefix/537.36"
3.在資料庫hive_nginx_log中建立資料表nginx_log。
use hive_nginx_log;
CREATE TABLE nginx_log(
client_ip STRING,
remote_login_name STRING,
remote_oauth_user STRING,
request_time_utf STRING,
request_method_url STRING,
status_code STRING,
send_bytes_size STRING,
source_access STRING,
client_info STRING)
partitioned by (dt string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;
4.建立Flume配置檔案flume-hive-nginx-log.conf,監聽Nginx日志,将Nginx日志推送到Hive表nginx_log中。
#定義agent名, source、channel、sink的名稱
myagent.sources = r1
myagent.channels = c1
myagent.sinks = k1
# 配置Source
myagent.sources.r1.type = exec
myagent.sources.r1.channels = c1
myagent.sources.r1.deserializer.outputCharset = UTF-8
# 配置需要監控的日志輸出目錄
myagent.sources.r1.command = tail -F /usr/local/nginx/logs/access.log
#設定緩存送出行數
myagent.sources.s1.deserializer.maxLineLength =1048576
myagent.sources.s1.fileSuffix = .DONE
myagent.sources.s1.ignorePattern = access(_\d{4}\-\d{2}\-\d{2}_\d{2})?\.log(\.DONE)?
myagent.sources.s1.consumeOrder = oldest
myagent.sources.s1.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
myagent.sources.s1.batchsize = 5
#具體定義channel
myagent.channels.c1.type = memory
myagent.channels.c1.capacity = 10000
myagent.channels.c1.transactionCapacity = 100
#具體定義sink
myagent.sinks.k1.type = hdfs
#%y-%m-%d/%H%M/%S
#這裡對應就是hive 表的目錄
myagent.sinks.k1.hdfs.path = hdfs://binghe100:9000/user/hive/warehouse/test_flume.db/nginx_log/%Y-%m-%d_%H
myagent.sinks.k1.hdfs.filePrefix = nginx-%Y-%m-%d_%H
myagent.sinks.k1.hdfs.fileSuffix = .log
myagent.sinks.k1.hdfs.fileType = DataStream
#不按照條數生成檔案
myagent.sinks.k1.hdfs.rollCount = 0
#HDFS上的檔案達到128M時生成一個檔案
myagent.sinks.k1.hdfs.rollSize = 2914560
myagent.sinks.k1.hdfs.useLocalTimeStamp = true
#組裝source、channel、sink
myagent.sources.r1.channels = c1
myagent.sinks.k1.channel = c1
5.啟動Flume
flume-ng agent --conf conf --conf-file /usr/local/flume-1.9.0/conf/flume-hive-nginx-log.conf --name myagent -Dflume.root.logger=INFO,console
6.通路Nginx,在浏覽器中輸入http://192.168.175.200進行通路
啟動後的輸出的資訊如下。
hdfs://binghe200:9000/user/hive/warehouse/test_flume.db/nginx_log/2019-07-31_23/nginx-2019-07-31_23.1564589089324.log
ALTER TABLE nginx_log ADD IF NOT EXISTS PARTITION (dt='2019-07-31_23') LOCATION '/user/hive/warehouse/test_flume.db/nginx_log/2019-07-31_23/';
hive> SELECT * FROM nginx_log;