Spark源碼中提供了全面的Mllib使用案例,其實我們可以很簡單的利用Idea和Maven修改打包這些示例,上傳到Spark用戶端執行。
一、Spark源碼下載下傳
在浏覽器輸入官網http://spark.apache.org/downloads.html,到下載下傳頁面。當然也可以在官網浏覽其文檔說明等。
如下圖所示,選擇要下載下傳的版本,package type選擇Source Code,然後點選spark包的連結進行下載下傳即可,下載下傳完成後将其解壓。
解壓檔案中,所有的spark示例代碼在examples中,所用到的測試資料在data中。
二、作為Maven項目導入到Idea中
在idea中點選File——New——Project from Existing Sources
下一步,在彈出的對話框中,選擇spark-2.0.2源碼所在路徑,選中根路徑下的pom檔案,點選OK。
下一步,選擇Maven類型。
下一步,選擇預設配置。
下一步,選擇預設的配置,
繼續下一步,選擇要導入的maven項目,
繼續下一步,稍等一會兒,激動地發現各個子產品都導入進來啦。箭頭所指即為spark example子產品。
三、修改、打包、運作示例代碼
下面一起看下examples——src——main——scala——mllib路徑下,都是spark官方提供的mllib的一些使用示例。
我們以其中的DecisionTreeClassificationExample例子為例,該例使用的資料是data根路徑中的sample_libsvm_data.txt的,為了在linux用戶端中執行該程式示例,需要修改對應的資料路徑,并将資料上傳到對應路徑中,此處我使用的路徑是/home/hdp_teu_dpd/user/xyx/spark,修改後的代碼如下:
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
// scalastyle:off println
package org.apache.spark.examples.mllib
import org.apache.spark.{SparkConf, SparkContext}
// $example on$
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
// $example off$
object MyDecisionTreeClassificationExample {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("DecisionTreeClassificationExample")
val sc = new SparkContext(conf)
//linux用戶端中我們用
val path = "file:///home/hdp_teu_dpd/user/xyx/spark"
// $example on$
// Load and parse the data file.
// val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val data = MLUtils.loadLibSVMFile(sc, s"$path/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
// Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
impurity, maxDepth, maxBins)
// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count().toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification tree model:\n" + model.toDebugString)
// Save and load model
// model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
// val sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")
model.save(sc, s"$path/target/myDecisionTreeClassificationModel")
val sameModel = DecisionTreeModel.load(sc, s"$path/target/myDecisionTreeClassificationModel")
// println(sameModel)
// $example off$
}
}
// scalastyle:on println
代碼修改完成後,下面對spark-examples_2.11子產品進行打包,打包方法見下圖。輕按兩下package,稍等一會,打包完成。
打包完成後,包所在位置為如下圖:
将得到的jar包上傳到/home/hdp_teu_dpd/user/xyx/spark。此時,jar包和測試資料都已存在spark用戶端的相同目錄中。
當該目錄下,執行如下指令:
spark-submit --class org.apache.spark.examples.mllib.MyDecisionTreeClassificationExample spark-examples_2.11-2.0.2.jar
執行完成後,在/home/hdp_teu_dpd/user/xyx/spark/target路徑中生成了對應的決策樹模型。至于模型的存儲形式還需要再研究下。
參考部落格:https://blog.csdn.net/bobozai86/article/details/80346370