叢集規劃

Flume叢集，負載均衡和故障轉移模式，筆者準備了3台機器安裝了flume，其中webapp200是應用伺服器，flume安裝在這裡，目的是收集應用伺服器上的日志，通過2個avro sink分别對接到flume130和flume131機器；再通過flume130和flume131将資料傳輸到HDFS。(注：吞吐量大的channels可以換成kafka)。

Webapp200：TAILDIR -> file -> avro

Flume130：avro -> file -> hdfs

Flume131：avro -> file -> hdfs

流程圖

flume資料采集_大資料采集系統Flume叢集部署叢集規劃流程圖下載下傳安裝配置Agent啟動flume關閉flume

下載下傳安裝

下載下傳位址

官網：http://flume.apache.org/

解壓

解壓到/opt/module/目錄

$ tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /opt/module/

配置環境變量

配置JAVA_HOME

修改配置檔案名稱

$ mv flume-env.sh.template flume-env.sh

修改Flume-env.sh

$ vi conf/flume-env.sh

修改JAVA_HOME，修改成自己的JAVA_HOME

export JAVA_HOME=/opt/module/jdk1.8.0_221

配置完後，将flume分發到其他機器

配置Agent

webapp200中的Agent

建立taildir-file-hdfs.conf，并添加如下内容：

# Name the components on this agenta1.sources = r1a1.sinks = k1 k2a1.channels = c1 # Describe/configure the sourcea1.sources.r1.type = TAILDIRa1.sources.r1.channels = c1a1.sources.r1.positionFile = /opt/module/apache-flume-1.9.0-bin/position/taildir_position.jsona1.sources.r1.filegroups = f1a1.sources.r1.filegroups.f1 = /opt/logs/info*.log* # Describe the sinkgroupsa1.sinkgroups = g1a1.sinkgroups.g1.sinks = k1 k2 k3a1.sinkgroups.g1.processor.type = load_balancea1.sinkgroups.g1.processor.backoff = truea1.sinkgroups.g1.processor.selector = round_robina1.sinkgroups.g1.processor.selector.maxTimeOut=10000 #Define the sink k1a1.sinks.k1.type = avroa1.sinks.k1.channel = c1a1.sinks.k1.hostname = flume130a1.sinks.k1.port = 4545 #Define the sink k2a1.sinks.k2.type = avroa1.sinks.k2.channel = c1a1.sinks.k2.hostname = flume131a1.sinks.k2.port = 4545 # Use a channel which buffers events in memorya1.channels.c1.type = filea1.channels.c1.checkpointDir=/opt/module/apache-flume-1.9.0-bin/data/checkpoint/balancea1.channels.c1.dataDirs=/opt/module/apache-flume-1.9.0-bin/data/balancea1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1a1.sinks.k2.channel = c1

flume130和flume131中的Agent

# Name the components on this agenta1.sources = r1a1.sinks = k1a1.channels = c1 # Describe/configure the sourcea1.sources.r1.type = avroa1.sources.r1.channels = c1a1.sources.r1.bind = 0.0.0.0a1.sources.r1.port = 4545 # Describe the sinka1.sinks.k1.type = hdfsa1.sinks.k1.channel = c1a1.sinks.k1.hdfs.path = hdfs://hadoop100:9000/flume/events/%y-%m-%d/%Ha1.sinks.k1.hdfs.useLocalTimeStamp = truea1.sinks.k1.hdfs.filePrefix = events-a1.sinks.k1.hdfs.batchSize=100a1.sinks.k1.hdfs.fileType=DataStreama1.sinks.k1.hdfs.rollInterval=0a1.sinks.k1.hdfs.rollSize=134217700a1.sinks.k1.hdfs.rollCount=0a1.sinks.k1.hdfs.round = truea1.sinks.k1.hdfs.roundValue = 1a1.sinks.k1.hdfs.roundUnit = hour # Use a channel which buffers events in memorya1.channels.c1.type = filea1.channels.c1.checkpointDir=/opt/module/apache-flume-1.9.0-bin/data/checkpoint/balancea1.channels.c1.dataDirs=/opt/module/apache-flume-1.9.0-bin/data/balancea1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100 # Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1

注意HDFS sink 需要放置相應的jar包和xml配置檔案到flume目錄下，下載下傳和使用連結(筆者親測有效，注意版本)：flume hdfs sink所需jar包(flume1.9.0 hadoop3.1.2)。

啟動flume

啟動指令

$ bin/flume-ng agent -n a1 -c conf -f job/taildir-file-avro.conf

背景啟動，在結尾加上&

$ nohup bin/flume-ng agent -n a1 -c conf -f job/taildir-file-avro.conf &

再加上nohup可以把原本在console輸出的運作日志輸出在[目前運作目錄]/nohup.out中

關閉flume

flume程序啟動動沒有關閉的指令，隻能kill掉。

檢視占用4545端口的程序ID

$ netstat -nap | grep 4545

或者直接jps找到flume程序，然後kill

$ kill [pid]