大数据探索：在树莓派上通过 Apache Spark on YARN 搭建 Hadoop 集群

由三个树莓派节点组成的 hadoop 集群

和许许多多有争议的话题一样，数据的大小之别被解释成这样一个笑话：

如果能被内存所存储，那么它就不是大数据。 ————佚名

似乎这儿有两种解决问题的方法：

我们可以找到一个足够大的数据集合，任何家用电脑的物理或虚拟内存都存不下。

我们可以买一些不用特别定制，我们现有数据就能淹没它的电脑：

—— 上手树莓派 2b

这个由设计师和工程师制作出来的精致小玩意儿拥有 1gb 的内存， microsd 卡充当它的硬盘，此外，每一台的价格都低于 50 美元，这意味着你可以花不到 250 美元的价格搭建一个 hadoop 集群。

或许天下没有比这更便宜的入场券来带你进入大数据的大门。

我最喜欢制作的原材料。

这里我将给出我原来为了制作树莓派集群购买原材料的链接，如果以后要在亚马逊购买的话你可先这些链接收藏起来，也是对本站的一点支持。(谢谢)

<a href="http://amzn.to/2bto1br" target="_blank">4 层亚克力支架</a>

双面胶，我有一些 3m 的，很好用

首先，装好三个树莓派，每一个用螺丝钉固定在亚克力面板上。（看下图）

接下来，安装以太网交换机，用双面胶贴在其中一个在亚克力面板上。

用双面胶贴将 usb 转接器贴在一个在亚克力面板使之成为最顶层。

接着就是一层一层都拼好——这里我选择将树莓派放在交换机和usb转接器的底下（可以看看完整安装好的两张截图）

想办法把线路放在需要的地方——如果你和我一样购买力 usb 线和网线，我可以将它们卷起来放在亚克力板子的每一层

现在不要急着上电，需要将系统烧录到 sd 卡上才能继续。

将其中一张烧录好的 sd 卡插在你想作为主节点的树莓派上，连接 usb 线并启动它。

在启动过程中有一些要注意的地方，我将带着你一起设置直到最后一步，记住我现在使用的 ip 段为 192.168.1.50 – 192.168.1.52，主节点是 .50，从节点是 .51 和 .52，你的网络可能会有所不同，如果你想设置静态 ip 的话可以在评论区看看或讨论。

一旦你完成了这些步骤，接下来要做的就是启用交换文件，spark on yarn 将分割出一块非常接近内存大小的交换文件，当你内存快用完时便会使用这个交换分区。

现在我准备介绍有关我的和“because we can geek”关于启动设置一些微妙的区别。

对于初学者，确保你给你的树莓派起了一个正式的名字——在 <code>/etc/hostname</code> 设置，我的主节点设置为 ‘raspberrypihadoopmaster’ ，从节点设置为 ‘raspberrypihadoopslave#’

主节点的 <code>/etc/hosts</code> 配置如下：

<code>#/etc/hosts</code>

<code>127.0.0.1 localhost</code>

<code>::1 localhost ip6-localhost ip6-loopback</code>

<code>ff02::1 ip6-allnodes</code>

<code>ff02::2 ip6-allrouters</code>

<code>192.168.1.50 raspberrypihadoopmaster</code>

<code>192.168.1.51 raspberrypihadoopslave1</code>

<code>192.168.1.52 raspberrypihadoopslave2</code>

如果你想让 hadoop、yarn 和 spark 运行正常的话，你也需要修改这些配置文件（不妨现在就编辑）。

这是 <code>hdfs-site.xml</code>：

<code><name>fs.default.name</name></code>

<code><value>hdfs://raspberrypihadoopmaster:54310</value></code>

<code><name>hadoop.tmp.dir</name></code>

这是 <code>yarn-site.xml</code> （注意内存方面的改变）：

<code><name>yarn.nodemanager.aux-services</name></code>

<code><value>mapreduce_shuffle</value></code>

<code><name>yarn.nodemanager.resource.cpu-vcores</name></code>

<code><name>yarn.nodemanager.resource.memory-mb</name></code>

<code><name>yarn.scheduler.minimum-allocation-mb</name></code>

<code><name>yarn.scheduler.maximum-allocation-mb</name></code>

<code><name>yarn.scheduler.minimum-allocation-vcores</name></code>

<code><name>yarn.scheduler.maximum-allocation-vcores</name></code>

<code><name>yarn.nodemanager.vmem-check-enabled</name></code>

<code><value>false</value></code>

<code><description>whether virtual memory limits will be enforced for containers</description></code>

<code><name>yarn.nodemanager.vmem-pmem-ratio</name></code>

<code><description>ratio between virtual memory to physical memory when setting memory limits for containers</description></code>

<code><name>yarn.resourcemanager.resource-tracker.address</name></code>

<code><value>raspberrypihadoopmaster:8025</value></code>

<code><name>yarn.resourcemanager.scheduler.address</name></code>

<code><value>raspberrypihadoopmaster:8030</value></code>

<code><name>yarn.resourcemanager.address</name></code>

<code><value>raspberrypihadoopmaster:8040</value></code>

<code>slaves</code>：

<code>raspberrypihadoopmaster</code>

<code>raspberrypihadoopslave1</code>

<code>raspberrypihadoopslave2</code>

<code>core-site.xml</code>：

如果所有设备都正常工作，在主节点上你应该执行如下命令：

<code>start-dfs.sh</code>

<code>start-yarn.sh</code>

当设备启动后，以 hadoop 用户执行，如果你遵循教程，用户应该是 <code>hduser</code>。

接下来执行 <code>hdfs dfsadmin -report</code> 查看三个节点是否都正确启动，确认你看到一行粗体文字 ‘live datanodes (3)’：

<code>configured capacity: 93855559680 (87.41 gb)</code>

<code>raspberry pi hadoop cluster picture straight on</code>

<code>present capacity: 65321992192 (60.84 gb)</code>

<code>dfs remaining: 62206627840 (57.93 gb)</code>

<code>under replicated blocks: 0</code>

<code>blocks with corrupt replicas: 0</code>

<code>missing blocks: 0</code>

<code>missing blocks (with replication factor 1): 0</code>

<code>live datanodes (3):</code>

<code>name: 192.168.1.51:50010 (raspberrypihadoopslave1)</code>

<code>hostname: raspberrypihadoopslave1</code>

<code>decommission status : normal</code>

你现在可以做一些简单的诸如 ‘hello, world!’ 的测试，或者直接进行下一步。

yarn 的意思是另一种非常好用的资源调度器（yet another resource negotiator），已经作为一个易用的资源管理器集成在 hadoop 基础安装包中。

我个人对 spark 还是留下非常深刻的印象，因为它提供了两种数据工程师和科学家都比较擅长的语言—— python 和 r。

我又创建了只有两行的文件 <code>spark-env.sh</code>，其中包含 spark 的配置文件目录。

<code>spark_master_ip=192.168.1.50</code>

<code>spark_worker_memory=512m</code>

(在 yarn 跑起来之前我不确定这些是否有必要。)

在 hadoop 世界里面的 ‘hello, world!’ 就是做单词计数。

我决定让我们的作品做一些内省式……为什么不统计本站最常用的单词呢？也许统计一些关于本站的大数据会更有用。

如果你有一个正在运行的 wordpress 博客，可以通过简单的两步来导出和净化。

<code>import bleach</code>

<code># change this next line to your 'import' filename, whatever you would like to strip</code>

<code>ascii_string = open('dqydj_with_tags.txt', 'r').read()</code>

<code>new_string = bleach.clean(ascii_string, tags=[], attributes={}, styles=[], strip=true)</code>

<code>new_string = new_string.encode('utf-8').strip()</code>

<code># change this next line to your 'export' filename</code>

<code>f = open('dqydj_stripped.txt', 'w')</code>

<code>f.write(new_string)</code>

<code>f.close()</code>

现在我们有了一个更小的、适合复制到树莓派所搭建的 hdfs 集群上的文件。

如果你不能树莓派主节点上完成上面的操作，找个办法将它传输上去（scp、 rsync 等等），然后用下列命令行复制到 hdfs 上。

<code>hdfs dfs -copyfromlocal dqydj_stripped.txt /dqydj_stripped.txt</code>

现在准备进行最后一步 - 向 apache spark 写入一些代码。

<code>import sys</code>

<code>from stop_words import get_stop_words</code>

<code>from pyspark import sparkcontext, sparkconf</code>

<code># create spark context with spark configuration</code>

<code>conf = sparkconf().setappname("spark count")</code>

<code>sc = sparkcontext(conf=conf)</code>

<code># get threshold</code>

<code>threshold = int(sys.argv[2])</code>

<code>except:</code>

<code>threshold = 5</code>

<code># read in text file and split each document into words</code>

<code>tokenized = sc.textfile(sys.argv[1]).flatmap(lambda line: line.split(" "))</code>

<code># count the occurrence of each word</code>

<code>wordcounts = tokenized.map(lambda word: (word.lower().strip(), 1)).reducebykey(lambda v1,v2:v1 +v2)</code>

<code># filter out words with fewer than threshold occurrences</code>

<code>filtered = wordcounts.filter(lambda pair:pair[1] >= threshold)</code>

<code>print "*" * 80</code>

<code>print "printing top words used"</code>

<code>print "-" * 80</code>

<code>filtered_sorted = sorted(filtered.collect(), key=lambda x: x[1], reverse = true)</code>

<code>for (word, count) in filtered_sorted: print "%s : %d" % (word.encode('utf-8').strip(), count)</code>

<code># remove stop words</code>

<code>print "\n\n"</code>

<code>print "printing top non-stop words used"</code>

<code># change this to your language code (see the stop-words documentation)</code>

<code>stop_words = set(get_stop_words('en'))</code>

<code>no_stop_words = filter(lambda x: x[0] not in stop_words, filtered_sorted)</code>

<code>for (word, count) in no_stop_words: print "%s : %d" % (word.encode('utf-8').strip(), count)</code>

保存好 wordcount.py，确保上面的路径都是正确无误的。

现在，准备念出咒语，让运行在 yarn 上的 spark 跑起来，你可以看到我在 dqydj 使用最多的单词是哪一个。

<code>/opt/spark-2.0.0-bin-hadoop2.7/bin/spark-submit –master yarn –executor-memory 512m –name wordcount –executor-cores 8 wordcount.py /dqydj_stripped.txt</code>

可能入列的单词有哪一些呢？“can, will, it’s, one, even, like, people, money, don’t, also“.

嘿，不错，“money”悄悄挤进了前十。在一个致力于金融、投资和经济的网站上谈论这似乎是件好事，对吧？

下面是的前 50 个最常用的词汇，请用它们刻画出有关我的文章的水平的结论。

我希望你能喜欢这篇关于 hadoop、yarn 和 apache spark 的教程，现在你可以在 spark 运行和编写其他的应用了。

你怎么看？你要建立一个树莓派 hadoop 集群吗？想要在其中挖掘一些什么吗？你在上面看到最令你惊奇的单词是什么？为什么 's&p' 也能上榜？

原文发布时间为：2017-05-07

本文来自云栖社区合作伙伴“linux中国”

大数据探索：在树莓派上通过 Apache Spark on YARN 搭建 Hadoop 集群

继续阅读

服务器配置——Apache

Apache静态文件访问配置（书封服务器）

apache httpd 配置

大数据排错SparkSpark集群启动时候，JAVA_HOME is not sethadoop集群，某台服务器jps无任何输出IDEAkafkahadoopspark sqlfile permissionsIDEA本地测试 - OutOfMemoryError: GC overhead limit exceededhdfs负载均衡

Ubuntu16.04安装Apache+MySQL+PHP1. 安装Apache2. 安装MySQL3. 安装PHP4. 安装phpMyAdmin

浅谈企业活动中进行数据分析的重要性

ubuntu14.04下安装hbse1.0.1.1

Apache配置SSLApache配置SSL

Windows下配置Apache的SSL服务

User Defined Hadoop DataType

Apache2.4.x 配置文件详解Apache配置需要了解如下：开始讲解：

配置apache支持PHP（win7）

Ambari介绍和架构原理

spark/scala关于【资源文件】加载方法概述外部文件加载方案测试资源文件打包入jar包中小结

NOSQL安全攻击

win10本地scala和spark安装安装scala安装spark