spark的submit

# Run on a YARN cluster
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \  # can be client for client mode
  --executor-memory 20G \
  --num-executors 50 \
  /path/to/examples.jar \
  1000

Some of the commonly used options are:

--class

: 應用的入口 (e.g. org.apache.spark.examples.SparkPi

)
--master

: 叢集的master (e.g. spark://23.195.26.187:7077

)
--deploy-mode

: 在node節點啟動 cluster模式(cluster

) ，外部模式：client (client

) (預設的)
--conf

: keyvalue對 “key=value”
application-jar

: jar包包含依賴的jar包必須可見（對叢集來說必須可見），比如hdfs路徑下：hdfs://

pathv或者是所有節點都存在的檔案file://

path.
application-arguments

:傳遞給main函數的參數

`yarn`	Connect to a YARN cluster in `client` or `cluster` mode depending on the value of `--deploy-mode` . 将會從 `HADOOP_CONF_DIR` or `YARN_CONF_DIR` 擷取配置檔案.
`yarn-client`	Equivalent to `yarn` with `--deploy-mode client` , which is preferred to `yarn-client`
`yarn-cluster`	Equivalent to `yarn` with `--deploy-mode cluster` , which is preferred to `yarn-cluster`

In general, configuration values explicitly set on a

SparkConf

take the highest precedence, then flags passed to

spark-submit

, then values in the defaults file.

通常，顯示通過SparkConf設定的參數的優先級優先級最高，其次是 spark-submit，最後才是預設的配置檔案。

是以即使你的在spark-submit 中  
        --master 省略也是可以 甚至 --deploy-mode 都是有預設值的，預設讀取conf/spark-defaults.conf 。

更進階的設定：

當

spark-submit時候

, application jar 以及任何依賴的通過

--jars

option 指定的jars，将會自動的轉移到叢集中 . spark采用以下幾種方式允許采用不同的方法來傳遞jars:

file: - Absolute paths and file:/

URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.每個executor從driver的httpserver來拉取檔案
hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected
local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.

Note that JARs and files are copied to the working directory for each SparkContext on the executor nodes. This can use up a significant amount of space over time and will need to be cleaned up. With YARN, cleanup is handled automatically, and with Spark standalone, automatic cleanup can be configured with the

spark.worker.cleanup.appDataTtl

property. Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates with

--packages

. All transitive dependencies will be handled when using this command. Additional repositories (or resolvers in SBT) can be added in a comma-delimited fashion with the flag

--repositories

. These commands can be used with

pyspark

spark-shell

, and

spark-submit

to include Spark Packages. For Python, the equivalent

--py-files

option can be used to distribute

.egg

.zip

and

.py

libraries to executors.

spark的submit

更進階的設定：

More Information

繼續閱讀

延雲行業搜尋資料庫在大資料生态中位置和重要性大資料的挑戰大資料技術的現狀延雲行業搜尋資料庫

Spark在windows環境裡跑時報錯找不到org.apache.hadoop.fs.FSDataInputStream

解決方案之：DM relay 處理單元報錯

Perl與網絡監控

Spark流式分析系統實作流式實時日志分析系統

Scala和Java二種方式實戰Spark Streaming開發

Spark基礎:Spark簡介及特點,運作模式,安裝Spark,Driver與Executor,Local模式,Standalone模式,Yarn模式,Mesos模式,WordCount案例,HA配置第1章 Spark概述第2章 Spark運作模式第3章案例實操

Spark實作wordcount

PAT (Advanced Level) Practise 1131 Subway Map (30)

ZOJ 3938 Defuse the Bomb

CSU 1565 Word Cloud

ZOJ 3700 Ever Dream

ZOJ 1199 Point of Intersection

CSU 1567 Reverse Rot

大資料排錯SparkSpark叢集啟動時候，JAVA_HOME is not sethadoop叢集，某台伺服器jps無任何輸出IDEAkafkahadoopspark sqlfile permissionsIDEA本地測試 - OutOfMemoryError: GC overhead limit exceededhdfs負載均衡

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結