今天我就做一个简单的小分享 ,四个小案例 分别是PV和UV的scala版本和java版本的
在学习scala的我们要知道,scala与java是可以互通的
话不多说 直接来看我们今天的需求 :
1 . PV是页面浏览量,在这里即是日志记录数。
2 . UV是用户访问量(独立访客)。把日志数据中的ip地址去重后,再进行统计即可。
首先我们准备好已经采集的点击流数据
大致长这个样子 这里我放大概20条数据用来才考参考 但是我为了体现spark的厉害我这里实际数据就是差不多1w多条吧
![](https://img.laitimes.com/img/9ZDMuAjOiMmIsIjOiQnIsICM38FdsYkRGZkRG9lcvx2bjxiNx8VZ6l2cs0TPBJmaOJDW0ZkMMBjVtJWd0ckW65UbM5WOHJWa5kHT20ESjBjUIF2X0hXZ0xCMx81dvRWYoNHLrdEZwZ1Rh5WNXp1bwNjW1ZUba9VZwlHdssmch1mclRXY39CXldWYtlWPzNXZj9mcw1ycz9WL49zZwpmL4kDO4ATNzEjM2IDNwkTMwIzLc52YucWbp5GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.jpg)
194.237.142.21 - - [18/Sep/2013:06:49:18 +0000] "GET /wp-content/uploads/2013/07/rstudio-git3.png HTTP/1.1" 304 0 "-" "Mozilla/4.0 (compatible;)"
183.49.46.228 - - [18/Sep/2013:06:49:23 +0000] "-" 400 0 "-" "-"
163.177.71.12 - - [18/Sep/2013:06:49:33 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
163.177.71.12 - - [18/Sep/2013:06:49:36 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
101.226.68.137 - - [18/Sep/2013:06:49:42 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
101.226.68.137 - - [18/Sep/2013:06:49:45 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
60.208.6.156 - - [18/Sep/2013:06:49:48 +0000] "GET /wp-content/uploads/2013/07/rcassandra.png HTTP/1.0" 200 185524 "http://cos.name/category/software/packages/" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
222.68.172.190 - - [18/Sep/2013:06:49:57 +0000] "GET /images/my.jpg HTTP/1.1" 200 19939 "http://www.angularjs.cn/A00n" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
222.68.172.190 - - [18/Sep/2013:06:50:08 +0000] "-" 400 0 "-" "-"
183.195.232.138 - - [18/Sep/2013:06:50:16 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
183.195.232.138 - - [18/Sep/2013:06:50:16 +0000] "HEAD / HTTP/1.1" 200 20 "-" "DNSPod-Monitor/1.0"
66.249.66.84 - - [18/Sep/2013:06:50:28 +0000] "GET /page/6/ HTTP/1.1" 200 27777 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
221.130.41.168 - - [18/Sep/2013:06:50:37 +0000] "GET /feed/ HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.66 Safari/537.36"
157.55.35.40 - - [18/Sep/2013:06:51:13 +0000] "GET /robots.txt HTTP/1.1" 200 150 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"
50.116.27.194 - - [18/Sep/2013:06:51:35 +0000] "POST /wp-cron.php?doing_wp_cron=1379487095.2510800361633300781250 HTTP/1.0" 200 0 "-" "WordPress/3.6; http://blog.fens.me"
58.215.204.118 - - [18/Sep/2013:06:51:35 +0000] "GET /nodejs-socketio-chat/ HTTP/1.1" 200 10818 "http://www.google.com/url?sa=t&rct=j&q=nodejs%20%E5%BC%82%E6%AD%A5%E5%B9%BF%E6%92%AD&source=web&cd=1&cad=rja&ved=0CCgQFjAA&url=%68%74%74%70%3a%2f%2f%62%6c%6f%67%2e%66%65%6e%73%2e%6d%65%2f%6e%6f%64%65%6a%73%2d%73%6f%63%6b%65%74%69%6f%2d%63%68%61%74%2f&ei=rko5UrylAefOiAe7_IGQBw&usg=AFQjCNG6YWoZsJ_bSj8kTnMHcH51hYQkAA&bvm=bv.52288139,d.aGc" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
58.215.204.118 - - [18/Sep/2013:06:51:36 +0000] "GET /wp-includes/js/jquery/jquery-migrate.min.js?ver=1.2.1 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
58.215.204.118 - - [18/Sep/2013:06:51:35 +0000] "GET /wp-includes/js/jquery/jquery.js?ver=1.10.2 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
58.215.204.118 - - [18/Sep/2013:06:51:36 +0000] "GET /wp-includes/js/comment-reply.min.js?ver=3.6 HTTP/1.1" 304 0 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
58.215.204.118 - - [18/Sep/2013:06:51:36 +0000] "GET /wp-content/uploads/2013/08/chat.png HTTP/1.1" 200 48968 "http://blog.fens.me/nodejs-socketio-chat/" "Mozilla/5.0 (Windows NT 5.1; rv:23.0) Gecko/20100101 Firefox/23.0"
接下来是PV java版本的:
public static void main(String[] args) {
// 1.创建SparkConf对象。设置appName和master地址
SparkConf sparkConf = new SparkConf().setAppName("java-PV-APP").setMaster("local[2]");
// 2.创建SparkContext对象
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
// 3.读取数据文件
JavaRDD<String> contentRDD = jsc.textFile("D:\\AA\\input\\access.log");
// 4.将一行数据作为输入,输出(pv,1)
JavaPairRDD<String, Integer> pvAndOneRDD = contentRDD.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) throws Exception {
return new Tuple2<String, Integer>("pv", 1);
}
});
// 5.聚合操作
JavaPairRDD<String, Integer> resultRDD = pvAndOneRDD.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
});
// 6.收集数据
Map<String, Integer> finalResult = resultRDD.collectAsMap();
System.out.println(finalResult);
// 7.释放SparkContext
jsc.stop();
}
这个是测试的结果
这个是PV scala版的
/**
* spark实现点击流日志分析案例:PV
*/
object PVScala {
// 执行入口
def main(args: Array[String]): Unit = {
// 1.创建SparkConf对象。设置appName和master地址
val conf: SparkConf = new SparkConf().setAppName("scala-PV-APP").setMaster("local[2]")
// 2.创建SparkContext对象
val context: SparkContext = new SparkContext(conf)
context.setLogLevel("WARN")
// 3.读取数据文件
val contentRDD: RDD[String] = context.textFile("D:\\AA\\input\\access.log")
// 4.将一行数据作为输入,输出(pv,1)
val pvAndOneRDD: RDD[(String, Int)] = contentRDD.map(x=>("pv",1))
// 5.聚合输出
val resultRDD: RDD[(String, Int)] = pvAndOneRDD.reduceByKey(_+_)
// 6.收集数据
val finalRDD: Array[(String, Int)] = resultRDD.collect
println(finalRDD.toBuffer)
// 7.关闭SparkContext
context.stop()
}
}
测试结果如下:
可以看到 java版和scala版本的结果是一样的
然后我们来看一下UV
首先是java版本的
public static void main(String[] args) {
// 1.创建SparkConf对象。设置appName和master地址
SparkConf sparkConf = new SparkConf().setAppName("java-UV-APP").setMaster("local[2]");
// 2.创建SparkContext对象
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
// 3.读取数据文件
JavaRDD<String> contentRDD = jsc.textFile("D:\\AA\\input\\access.log");
// 4.取出每一行数据中的ip地址
JavaRDD<String> ipsRDD = contentRDD.map(new Function<String, String>() {
public String call(String v1) throws Exception {
String[] arr = v1.split(" ");
return arr[0];
}
});
// 5.重ip地址,最后输出格式 ("UV",1)
JavaPairRDD<String, Integer> uvAndOneRDD = ipsRDD.distinct().mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String s) throws Exception {
return new Tuple2<String, Integer>("uv", 1);
}
});
// 6.聚合输出
JavaPairRDD<String, Integer> resultRDD = uvAndOneRDD.reduceByKey(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}
});
// 7.收集数据
Map<String, Integer> stringIntegerMap = resultRDD.collectAsMap();
System.out.println(stringIntegerMap);
// 8.释放SparkContext
jsc.stop();
}
然后给上去重的总量:
接下来就是scala版本的:
object UVScala {
// 执行入口
def main(args: Array[String]): Unit = {
// 1.创建SparkConf对象。设置appName和master地址
val conf: SparkConf = new SparkConf().setAppName("scala-UV-APP").setMaster("local[2]")
// 2.创建SparkContext对象
val context: SparkContext = new SparkContext(conf)
context.setLogLevel("WARN")
// 3.读取数据文件
val contentRDD: RDD[String] = context.textFile("D:\\AA\\input\\access.log")
// 4.取出每一行数据中的ip地址
val ipsRDD: RDD[String] = contentRDD.map(_.split(" ")).map(x=>x(0))
// 5.去重ip地址,最后输出格式 ("UV",1)
val uvAndOneRDD: RDD[(String, Int)] = ipsRDD.distinct().map(x=>("uv",1))
// 6.聚合输出
val resultRDD: RDD[(String, Int)] = uvAndOneRDD.reduceByKey(_+_)
// 7.收集数据
val finalRDD: Array[(String, Int)] = resultRDD.collect
println(finalRDD.toBuffer)
// 8.关闭SparkContext
context.stop()
}
}
我们看一下结果如何 是否是和java版本的结果是一样的:
测试结果可以看到,两个版本的结果是一样的。
从两者代码上来看的话,对于scala会比较优雅,写出来会简单一点,配合着神奇的 “_” 对于我们的代码会更加优雅
今天简单的分享到一个PV和UV ,下次更新的话应该会简单的讲解TopN即访问量
最高的N。