開發者學堂課程【大資料實時計算架構Spark快速入門:運作時_程式排程_1】學習筆記,與課程緊密聯系,讓使用者快速學習知識。
課程位址:
https://developer.aliyun.com/learning/course/100/detail/1650運作時_程式排程_1
Internally,each RDD is characterized by five main properties:
-A list of partitions
- Afunction for computing each split
- A list of dependencies on other RDDs
- optionally,a Partitioner for key-value RDDs (e.g. to say that the RDD is hash.
- optionally,a list of preferred locations to compute each split on (e.g. block
an HDES file)
- optionally,a Partitioner for key-value RDDs (e.g. to say that the RDD is hash.
- optionally,a list of preferred locations to compute each split on (e.g. block
an HDES file)
Spark運作時
流程示意
分布式檔案系統(File system ) --加載資料集
transformations 延遲執行--針對 RDD 的操作
Action 觸發執行
代碼示例
lines = se.textFile("hdfs://...”)
加載進來成為RDD
errors = lines.filter(_.startsWith(“ERROR”))
Transformation轉換
errors.persist()
緩存RDD
Mysql_errors=errors.filter(_.contain( "MySQL”)).count
Action執行
http_errors = errors.filter(_.contain( "Http")).count. Action執行