以SparkStreaming + Kafka 實作
假定使用者有某個周末網民網購停留時間的日志文本,基于某些業務要求,要求開發
Spark應用程式實作如下功能:
1、實時統計連續網購時間超過半個小時的女性網民資訊。
2、周末兩天的日志檔案第一列為姓名,第二列為性别,第三列為本次停留時間,單
位為分鐘,分隔符為“,”。
資料:
log1.txt:周六網民停留日志
LiuYang,female,20
YuanJing,male,10
GuoYijun,male,5
CaiXuyu,female,50
Liyuan,male,20
FangBo,female,50
LiuYang,female,20
YuanJing,male,10
GuoYijun,male,50
CaiXuyu,female,50
FangBo,female,60
log2.txt:周日網民停留日志
LiuYang,female,20
YuanJing,male,10
CaiXuyu,female,50
FangBo,female,50
GuoYijun,male,5
CaiXuyu,female,50
Liyuan,male,20
CaiXuyu,female,50
FangBo,female,50
LiuYang,female,20
YuanJing,male,10
FangBo,female,50
GuoYijun,male,50
CaiXuyu,female,50
FangBo,female,60
實作步驟
一 、pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.xu.sparktest1</groupId>
<artifactId>sparktest1</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<spark.version>2.1.0</spark.version>
<scala.version>2.11</scala.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.1.1</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
二 、scala代碼
package com.xu
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{HashPartitioner, SparkConf}
object SparkStreamTest5 {
def main(args: Array[String]): Unit = {
val sparkConf=new SparkConf().setAppName("SparkHomeWork2").setMaster("local[2]")
val ssc=new StreamingContext(sparkConf,Seconds(5))
ssc.sparkContext.setLogLevel("WARN")
ssc.checkpoint(".")
//建立連接配接kafka的參數
val brokeList = "node01:9092,node02:9092,node03:9092"
val zk = "node01:2181/kafka"
val sourceTopic = "sparkhomework-test4"
val consumerGroup = "sparkhomework2"
val topicMap = sourceTopic.split(",").map((_, 1.toInt)).toMap
val lines = KafkaUtils.createStream(ssc, zk, consumerGroup, topicMap).map(_._2)
val results = lines.flatMap(_.split(" ")).filter(_.contains("female"))
val femaleData: DStream[(String, Int)] = results.map { line =>
val t = line.split(',')
(t(0), t(2).toInt)
}.reduceByKey(_ + _)
//篩選出時間大于兩個小時的女性網民資訊,并輸出
val date = femaleData.updateStateByKey(updateFunction, new HashPartitioner(ssc.sparkContext.defaultParallelism), true).filter(line => line._2 > 120)
date.print()
ssc.start()
ssc.awaitTermination()
}
val updateFunction = (iter: Iterator[(String, Seq[Int], Option[Int])]) => {
iter.flatMap { case (x, y, z) => Some(y.sum + z.getOrElse(0)).map(v => (x, v)) }
}
}
三 啟動kafka:
kafka叢集啟動,必須先啟動zookeeper:
啟動zookeeper
三台伺服器啟動zookeeper,三台機器都執行以下指令啟動zookeeper:
cd /export/servers/zookeeper-3.4.5-cdh5.14.0
bin/zkServer.sh start
bin/zkServer.sh status
啟動kafka:在三台服務上執行:
cd /export/servers/kafka_2.11-0.10.0.0
nohup bin/kafka-server-start.sh config/server.properties 2>&1 &
停止kafka,在三台服務上執行:
cd /export/servers/kafka_2.11-0.10.0.0
bin/kafka-server-stop.sh
四 資料準備
将上述log進行壓縮,并以空格進行進行分割,如下:
LiuYang,female,20 YuanJing,male,10 GuoYijun,male,5 CaiXuyu,female,50 Liyuan,male,20 FangBo,female,50 LiuYang,female,20 YuanJing,male,10 GuoYijun,male,50 CaiXuyu,female,50 FangBo,female,60
LiuYang,female,20 YuanJing,male,10 CaiXuyu,female,50 FangBo,female,50 GuoYijun,male,5 CaiXuyu,female,50 Liyuan,male,20 CaiXuyu,female,50 FangBo,female,50 LiuYang,female,20 YuanJing,male,10 FangBo,female,50 GuoYijun,male,50 CaiXuyu,female,50 FangBo,female,60
五、kafka建立Topic
cd /export/servers/kafka_2.11-0.10.0.0
--使用指令建立 Topic
bin/kafka-topics.sh --create --topic sparkhomework-test4 --replication-factor 1 --partitions 3 --zookeeper node01:2181/kafka
--開啟 Producer
bin/kafka-console-producer.sh --broker-list node01:9092,node02:9092,node03:9092 --topic sparkhomework-test4
注意:我這裡kafka的路徑為:
node01:2181/kafka
,這個是在config/server.properties中配置的,
六 測試
啟動scala的main方法,然後在控制台依次輸入 log1,log2:
結果:
如果在LInux控制台輸入log時,程式列印,并沒有出現紅色部分的内容,表示topic配置問題,并沒有正确連接配接!!!