使用Idea和Maven執行Spark源碼中Mllib的示例

2023-03-22 02:35:03

Spark源碼中提供了全面的Mllib使用案例，其實我們可以很簡單的利用Idea和Maven修改打包這些示例，上傳到Spark用戶端執行。

一、Spark源碼下載下傳

在浏覽器輸入官網http://spark.apache.org/downloads.html，到下載下傳頁面。當然也可以在官網浏覽其文檔說明等。

如下圖所示，選擇要下載下傳的版本，package type選擇Source Code，然後點選spark包的連結進行下載下傳即可，下載下傳完成後将其解壓。

使用Idea和Maven執行Spark源碼中Mllib的示例

解壓檔案中，所有的spark示例代碼在examples中，所用到的測試資料在data中。

二、作為Maven項目導入到Idea中

在idea中點選File——New——Project from Existing Sources

使用Idea和Maven執行Spark源碼中Mllib的示例

下一步，在彈出的對話框中，選擇spark-2.0.2源碼所在路徑，選中根路徑下的pom檔案，點選OK。

使用Idea和Maven執行Spark源碼中Mllib的示例

下一步，選擇Maven類型。

使用Idea和Maven執行Spark源碼中Mllib的示例

下一步，選擇預設配置。

使用Idea和Maven執行Spark源碼中Mllib的示例

下一步，選擇預設的配置，

使用Idea和Maven執行Spark源碼中Mllib的示例

繼續下一步，選擇要導入的maven項目，

使用Idea和Maven執行Spark源碼中Mllib的示例

繼續下一步，稍等一會兒，激動地發現各個子產品都導入進來啦。箭頭所指即為spark example子產品。

使用Idea和Maven執行Spark源碼中Mllib的示例

三、修改、打包、運作示例代碼

下面一起看下examples——src——main——scala——mllib路徑下，都是spark官方提供的mllib的一些使用示例。

我們以其中的DecisionTreeClassificationExample例子為例，該例使用的資料是data根路徑中的sample_libsvm_data.txt的，為了在linux用戶端中執行該程式示例，需要修改對應的資料路徑，并将資料上傳到對應路徑中，此處我使用的路徑是/home/hdp_teu_dpd/user/xyx/spark，修改後的代碼如下：

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

// scalastyle:off println
package org.apache.spark.examples.mllib

import org.apache.spark.{SparkConf, SparkContext}
// $example on$
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
// $example off$

object MyDecisionTreeClassificationExample {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("DecisionTreeClassificationExample")
    val sc = new SparkContext(conf)
    //linux用戶端中我們用
    val path = "file:///home/hdp_teu_dpd/user/xyx/spark"

    // $example on$
    // Load and parse the data file.
    // val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
    val data = MLUtils.loadLibSVMFile(sc, s"$path/sample_libsvm_data.txt")
    // Split the data into training and test sets (30% held out for testing)
    val splits = data.randomSplit(Array(0.7, 0.3))
    val (trainingData, testData) = (splits(0), splits(1))

    // Train a DecisionTree model.
    //  Empty categoricalFeaturesInfo indicates all features are continuous.
    val numClasses = 2
    val categoricalFeaturesInfo = Map[Int, Int]()
    val impurity = "gini"
    val maxDepth = 5
    val maxBins = 32

    val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
      impurity, maxDepth, maxBins)

    // Evaluate model on test instances and compute test error
    val labelAndPreds = testData.map { point =>
      val prediction = model.predict(point.features)
      (point.label, prediction)
    }
    val testErr = labelAndPreds.filter(r => r._1 != r._2).count().toDouble / testData.count()
    println("Test Error = " + testErr)
    println("Learned classification tree model:\n" + model.toDebugString)

    // Save and load model
    // model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
    // val sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")

    model.save(sc, s"$path/target/myDecisionTreeClassificationModel")
    val sameModel = DecisionTreeModel.load(sc, s"$path/target/myDecisionTreeClassificationModel")
    // println(sameModel)
    // $example off$
  }
}
// scalastyle:on println

代碼修改完成後，下面對spark-examples_2.11子產品進行打包，打包方法見下圖。輕按兩下package，稍等一會，打包完成。

使用Idea和Maven執行Spark源碼中Mllib的示例

打包完成後，包所在位置為如下圖：

使用Idea和Maven執行Spark源碼中Mllib的示例

将得到的jar包上傳到/home/hdp_teu_dpd/user/xyx/spark。此時，jar包和測試資料都已存在spark用戶端的相同目錄中。

當該目錄下，執行如下指令：

spark-submit --class org.apache.spark.examples.mllib.MyDecisionTreeClassificationExample spark-examples_2.11-2.0.2.jar

執行完成後，在/home/hdp_teu_dpd/user/xyx/spark/target路徑中生成了對應的決策樹模型。至于模型的存儲形式還需要再研究下。

參考部落格：https://blog.csdn.net/bobozai86/article/details/80346370

spark 機器學習 mllib

上一篇: 基于spark mllib的LDA模型訓練源碼解析

下一篇: spark之MLlib機器學習-線性回歸

繼續閱讀