天天看點

spark的submit

# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000
      

Some of the commonly used options are:

  • --class

    : 應用的入口 (e.g. org.apache.spark.examples.SparkPi

    )

  • --master

    : 叢集的​​​master​​​ (e.g. spark://23.195.26.187:7077

    )

  • --deploy-mode

    : 在node節點啟動 cluster模式(cluster

    ) , 外部模式:client (client

    ) (預設的)

  • --conf

    : keyvalue對 “key=value”

  • application-jar

    : jar包包含依賴的jar包必須可見(對叢集來說必須可見),比如hdfs路徑下:hdfs://

     pathv或者是所有節點都存在的檔案file://

     path.

  • application-arguments

    :傳遞給main函數的參數

​yarn​

Connect to a ​​YARN ​​​cluster in ​

​client​

​ or ​

​cluster​

​ mode depending on the value of  ​

​--deploy-mode​

​. 将會從​

​HADOOP_CONF_DIR​

​ or ​

​YARN_CONF_DIR​

​ 擷取配置檔案.

​yarn-client​

Equivalent to ​

​yarn​

​ with ​

​--deploy-mode client​

​, which is preferred to `yarn-client`

​yarn-cluster​

Equivalent to ​

​yarn​

​ with ​

​--deploy-mode cluster​

​, which is preferred to `yarn-cluster`

 In general, configuration values explicitly set on a ​

​SparkConf​

​ take the highest precedence, then flags passed to ​

​spark-submit​

​, then values in the defaults file.

通常,顯示通過SparkConf設定的參數的優先級優先級最高,其次是 spark-submit,最後才是預設的配置檔案。

​是以即使你的在spark-submit 中  ​

​--master 省略也是可以 甚至 --deploy-mode 都是有預設值的,預設讀取conf/spark-defaults.conf 。​

​​

更進階的設定:

當​

​spark-submit時候​

​, application jar 以及任何依賴的通過 ​

​--jars​

​ option 指定的jars,将會自動的轉移到叢集中 . spark采用以下幾種方式允許采用不同的方法來傳遞jars:

  • file: - Absolute paths and file:/

     URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.每個executor從driver的httpserver來拉取檔案

  • hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected
  • local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.

Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. This can use up a significant amount of space over time and will need to be cleaned up. With YARN, cleanup is handled automatically, and with Spark standalone, automatic cleanup can be configured with the ​

​spark.worker.cleanup.appDataTtl​

​ property. Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates with ​

​--packages​

​. All transitive dependencies will be handled when using this command. Additional repositories (or resolvers in SBT) can be added in a comma-delimited fashion with the flag ​

​--repositories​

​. These commands can be used with ​

​pyspark​

​, ​

​spark-shell​

​, and ​

​spark-submit​

​ to include Spark Packages. For Python, the equivalent ​

​--py-files​

​ option can be used to distribute ​

​.egg​

​, ​

​.zip​

​ and ​

​.py​

​ libraries to executors.

More Information

繼續閱讀