SparkSession - Spark SQL 的入口 - walkwalkwalk

SparkSession - Spark SQL 的入口

翻譯自：https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-SparkSession.html

概述

SparkSession 是 Spark SQL 的入口。使用 Dataset 或者 Datafram 編寫 Spark SQL 應用的時候，第一個要建立的對象就是 SparkSession。

Note：在 Spark 2.0 中， SparkSession 合并了 SQLContext 和 HiveContext。

你可以通過 SparkSession.builder 來建立一個 SparkSession 的執行個體,并通過 stop 函數來停止 SparkSession。

import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
  .appName("My Spark Application")  // optional and will be autogenerated if not specified
  .master("local[*]")               // avoid hardcoding the deployment environment
  .enableHiveSupport()              // self-explanatory, isn\'t it?
  .config("spark.sql.warehouse.dir", "target/spark-warehouse")
  .getOrCreate

你可以在一個 Spark 應用中使用多個 SparkSession，這樣子就可以通過 SparSession 将多個關系實體隔離開來(可以參考 catalog 屬性)。

scala> spark.catalog.listTables.show
+------------------+--------+-----------+---------+-----------+
|              name|database|description|tableType|isTemporary|
+------------------+--------+-----------+---------+-----------+
|my_permanent_table| default|       null|  MANAGED|      false|
|              strs|    null|       null|TEMPORARY|       true|
+------------------+--------+-----------+---------+-----------+

在 SparkSession 的内部，包含了SparkContext， SharedState，SessionState 幾個對象。下表中介紹了每個對象的大體功能：

Name	Type	Description
sparkContext	SparkContext	spark功能的主要入口點。可以通過 sparkConext在叢集上建立RDD, accumulators 和 broadcast variables
existingSharedState	Option[SharedState]	一個内部類負責儲存不同session的共享狀态
parentSessionState	Option[SessionState]	複制父session的狀态

下圖是 SparkSession 的類和方法, 這些方法包含了建立 DataSet, DataFrame, Streaming 等等。

Method	Description
builder	"Opens" a builder to get or create a SparkSession instance
version	Returns the current version of Spark.
implicits	Use import spark.implicits._ to import the implicits conversions and create Datasets from (almost arbitrary) Scala objects.
emptyDataset[T]	Creates an empty Dataset[T].
range	Creates a Dataset[Long].
sql	Executes a SQL query (and returns a DataFrame).
udf	Access to user-defined functions (UDFs).
table	Creates a DataFrame from a table.
catalog	Access to the catalog of the entities of structured queries
read	Access to DataFrameReader to read a DataFrame from external files and storage systems.
conf	Access to the current runtime configuration.
readStream	Access to DataStreamReader to read streaming datasets.
streams	Access to StreamingQueryManager to manage structured streaming queries.
newSession	Creates a new SparkSession.
stop	Stops the SparkSession.

Builder

Builder 是 SparkSession 的構造器。通過 Builder, 可以添加各種配置。

Builder 的方法如下：

Method	Description
getOrCreate	擷取或者建立一個 sparkSession
enableHiveSupport	增加支援 hive Support
appName	設定 application 的名字
config	設定各種配置

import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder
  .appName("My Spark Application")  // optional and will be autogenerated if not specified
  .master("local[*]")               // avoid hardcoding the deployment environment
  .enableHiveSupport()              // self-explanatory, isn\'t it?
  .getOrCreate

ShareState

ShareState 是 SparkSession 的一個内部類，負責儲存多個有效session之間的共享狀态。下表介紹了ShareState的屬性。

Name	Type	Description
cacheManager	CacheManager	這個是 SQLContext 的支援類，會自動儲存 query 的查詢結果。這樣子查詢在執行過程中，就可以使用這些查詢結果
externalCatalog	ExternalCatalog	儲存外部系統的 catalog
globalTempViewManager	GlobalTempViewManager	一個線程安全的類，用來管理 global temp view，并提供 create , update , remove 的等原子操作，來管理這些 view
jarClassLoader	NonClosableMutableURLClassLoader	加載使用者添加的 jar 包
listener	SQLListener	一個監聽類
sparkContext	SparkContext	Spark 的核心入口類
warehousePath	String	MetaStore 的位址，可以通過 spark.sql.warehouse.dir 或者 hive-site.xml 中的 hive.metastore.warehouse.dir 來指定， Spark 會覆寫 hive 的參數

ShareState 會使用一個 sparkContext 作為構造參數。如果可以在 CLASSPATH 中找到 hive-site.xml，ShareState 會将它加入到 sparkContext 的 hadoop configuration 中。

通過設定

log4j.logger.org.apache.spark.sql.internal.SharedState=INFO

可以看到相應的日志。

SparkSession - Spark SQL 的 入口 - walkwalkwalk