Spark源码中提供了全面的Mllib使用案例,其实我们可以很简单的利用Idea和Maven修改打包这些示例,上传到Spark客户端执行。
一、Spark源码下载
在浏览器输入官网http://spark.apache.org/downloads.html,到下载页面。当然也可以在官网浏览其文档说明等。
如下图所示,选择要下载的版本,package type选择Source Code,然后点击spark包的链接进行下载即可,下载完成后将其解压。
![](https://img.laitimes.com/img/__Qf2AjLwojIjJCLyojI0JCLiAzNvwVZ2x2bzNXak9CX90TQNNkRrFlQKBTSvwFbslmZvwFMwQzLcVmepNHdu9mZvwFVywUNMZTY18CX052bm9CX9EEVPVTREVWNk1mYwh2MMBjVtJWd0ckW65UbM5WOHJWa5kHT20ESjBjUIF2LcRHelR3LcJzLctmch1mclRXY39jN4YDOyETMzEjNwcDM4EDMy8CX0Vmbu4GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.jpg)
解压文件中,所有的spark示例代码在examples中,所用到的测试数据在data中。
二、作为Maven项目导入到Idea中
在idea中点击File——New——Project from Existing Sources
下一步,在弹出的对话框中,选择spark-2.0.2源码所在路径,选中根路径下的pom文件,点击OK。
下一步,选择Maven类型。
下一步,选择默认配置。
下一步,选择默认的配置,
继续下一步,选择要导入的maven项目,
继续下一步,稍等一会儿,激动地发现各个模块都导入进来啦。箭头所指即为spark example模块。
三、修改、打包、运行示例代码
下面一起看下examples——src——main——scala——mllib路径下,都是spark官方提供的mllib的一些使用示例。
我们以其中的DecisionTreeClassificationExample例子为例,该例使用的数据是data根路径中的sample_libsvm_data.txt的,为了在linux客户端中执行该程序示例,需要修改对应的数据路径,并将数据上传到对应路径中,此处我使用的路径是/home/hdp_teu_dpd/user/xyx/spark,修改后的代码如下:
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
// scalastyle:off println
package org.apache.spark.examples.mllib
import org.apache.spark.{SparkConf, SparkContext}
// $example on$
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
// $example off$
object MyDecisionTreeClassificationExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("DecisionTreeClassificationExample")
val sc = new SparkContext(conf)
//linux客户端中我们用
val path = "file:///home/hdp_teu_dpd/user/xyx/spark"
// $example on$
// Load and parse the data file.
// val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val data = MLUtils.loadLibSVMFile(sc, s"$path/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
impurity, maxDepth, maxBins)
// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count().toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification tree model:\n" + model.toDebugString)
// Save and load model
// model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
// val sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")
model.save(sc, s"$path/target/myDecisionTreeClassificationModel")
val sameModel = DecisionTreeModel.load(sc, s"$path/target/myDecisionTreeClassificationModel")
// println(sameModel)
// $example off$
}
}
// scalastyle:on println
代码修改完成后,下面对spark-examples_2.11模块进行打包,打包方法见下图。双击package,稍等一会,打包完成。
打包完成后,包所在位置为如下图:
将得到的jar包上传到/home/hdp_teu_dpd/user/xyx/spark。此时,jar包和测试数据都已存在spark客户端的相同目录中。
当该目录下,执行如下命令:
spark-submit --class org.apache.spark.examples.mllib.MyDecisionTreeClassificationExample spark-examples_2.11-2.0.2.jar
执行完成后,在/home/hdp_teu_dpd/user/xyx/spark/target路径中生成了对应的决策树模型。至于模型的存储形式还需要再研究下。
参考博客:https://blog.csdn.net/bobozai86/article/details/80346370