使用Idea和Maven执行Spark源码中Mllib的示例

2023-03-22 02:35:03

Spark源码中提供了全面的Mllib使用案例，其实我们可以很简单的利用Idea和Maven修改打包这些示例，上传到Spark客户端执行。

一、Spark源码下载

在浏览器输入官网http://spark.apache.org/downloads.html，到下载页面。当然也可以在官网浏览其文档说明等。

如下图所示，选择要下载的版本，package type选择Source Code，然后点击spark包的链接进行下载即可，下载完成后将其解压。

使用Idea和Maven执行Spark源码中Mllib的示例

解压文件中，所有的spark示例代码在examples中，所用到的测试数据在data中。

二、作为Maven项目导入到Idea中

在idea中点击File——New——Project from Existing Sources

使用Idea和Maven执行Spark源码中Mllib的示例

下一步，在弹出的对话框中，选择spark-2.0.2源码所在路径，选中根路径下的pom文件，点击OK。

使用Idea和Maven执行Spark源码中Mllib的示例

下一步，选择Maven类型。

使用Idea和Maven执行Spark源码中Mllib的示例

下一步，选择默认配置。

使用Idea和Maven执行Spark源码中Mllib的示例

下一步，选择默认的配置，

使用Idea和Maven执行Spark源码中Mllib的示例

继续下一步，选择要导入的maven项目，

使用Idea和Maven执行Spark源码中Mllib的示例

继续下一步，稍等一会儿，激动地发现各个模块都导入进来啦。箭头所指即为spark example模块。

使用Idea和Maven执行Spark源码中Mllib的示例

三、修改、打包、运行示例代码

下面一起看下examples——src——main——scala——mllib路径下，都是spark官方提供的mllib的一些使用示例。

我们以其中的DecisionTreeClassificationExample例子为例，该例使用的数据是data根路径中的sample_libsvm_data.txt的，为了在linux客户端中执行该程序示例，需要修改对应的数据路径，并将数据上传到对应路径中，此处我使用的路径是/home/hdp_teu_dpd/user/xyx/spark，修改后的代码如下：

/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements.  See the NOTICE file distributed with
 * this work for additional information regarding copyright ownership.
 * The ASF licenses this file to You under the Apache License, Version 2.0
 * (the "License"); you may not use this file except in compliance with
 * the License.  You may obtain a copy of the License at
 *
 *    http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

// scalastyle:off println
package org.apache.spark.examples.mllib

import org.apache.spark.{SparkConf, SparkContext}
// $example on$
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
// $example off$

object MyDecisionTreeClassificationExample {

  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("DecisionTreeClassificationExample")
    val sc = new SparkContext(conf)
    //linux客户端中我们用
    val path = "file:///home/hdp_teu_dpd/user/xyx/spark"

    // $example on$
    // Load and parse the data file.
    // val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
    val data = MLUtils.loadLibSVMFile(sc, s"$path/sample_libsvm_data.txt")
    // Split the data into training and test sets (30% held out for testing)
    val splits = data.randomSplit(Array(0.7, 0.3))
    val (trainingData, testData) = (splits(0), splits(1))

    // Train a DecisionTree model.
    //  Empty categoricalFeaturesInfo indicates all features are continuous.
    val numClasses = 2
    val categoricalFeaturesInfo = Map[Int, Int]()
    val impurity = "gini"
    val maxDepth = 5
    val maxBins = 32

    val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
      impurity, maxDepth, maxBins)

    // Evaluate model on test instances and compute test error
    val labelAndPreds = testData.map { point =>
      val prediction = model.predict(point.features)
      (point.label, prediction)
    }
    val testErr = labelAndPreds.filter(r => r._1 != r._2).count().toDouble / testData.count()
    println("Test Error = " + testErr)
    println("Learned classification tree model:\n" + model.toDebugString)

    // Save and load model
    // model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
    // val sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")

    model.save(sc, s"$path/target/myDecisionTreeClassificationModel")
    val sameModel = DecisionTreeModel.load(sc, s"$path/target/myDecisionTreeClassificationModel")
    // println(sameModel)
    // $example off$
  }
}
// scalastyle:on println

代码修改完成后，下面对spark-examples_2.11模块进行打包，打包方法见下图。双击package，稍等一会，打包完成。

使用Idea和Maven执行Spark源码中Mllib的示例

打包完成后，包所在位置为如下图：

使用Idea和Maven执行Spark源码中Mllib的示例

将得到的jar包上传到/home/hdp_teu_dpd/user/xyx/spark。此时，jar包和测试数据都已存在spark客户端的相同目录中。

当该目录下，执行如下命令：

spark-submit --class org.apache.spark.examples.mllib.MyDecisionTreeClassificationExample spark-examples_2.11-2.0.2.jar

执行完成后，在/home/hdp_teu_dpd/user/xyx/spark/target路径中生成了对应的决策树模型。至于模型的存储形式还需要再研究下。

参考博客：https://blog.csdn.net/bobozai86/article/details/80346370

spark 机器学习 mllib

上一篇: 基于spark mllib的LDA模型训练源码解析

下一篇: spark之MLlib机器学习-线性回归

继续阅读