Structured Streaming Programming Guide

<a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html">https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html</a>

<a href="http://www.slideshare.net/databricks/a-deep-dive-into-structured-streaming">http://www.slideshare.net/databricks/a-deep-dive-into-structured-streaming</a>

structured streaming is a scalable and fault-tolerant stream processing engine built on the spark sql engine.

you can express your streaming computation the same way you would express a batch computation on static data.

finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and write ahead logs.

in short, structured streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.

你可以像在靜态資料源上一樣，使用dataframe接口去執行sql，這些sql會跑在和batch相同的optimized spark sql engine上

并且可以保證exactly-once fault-tolerance，通過checkpointing and write ahead logs

隻是将dstream抽象，換成dataframe，即table

這樣就可以進行結構化的操作，

并且基本和處理batch資料一樣，

可以看到差别不大

整個過程是這樣的，

可以看到，這裡的output模式是complete，因為有聚合，是以每次輸出需要，輸出until now的統計資料

輸出的mode，分為，

the “output” is defined as what gets written out to the external storage. the output can be defined in different modes

complete mode - the entire updated result table will be written to the external storage. it is up to the storage connector to decide how to handle writing of the entire table.

append mode - only the new rows appended in the result table since the last trigger will be written to the external storage. this is applicable only on the queries where existing rows in the result table are not expected to change.

update mode - only the rows that were updated in the result table since the last trigger will be written to the external storage (not available yet in spark 2.0). note that this is different from the complete mode in that this mode does not output the rows that are not changed.

complete mode上面的例子已經給出

append mode，就是每次隻輸出增量，這個對于沒有聚合的場景就是合适的

window operations on event time

spark認為自己對于event time是天然支援的，隻需要把它作為dataframe裡面的一個列，然後做groupby即可以

然後對于late data，因為是增量輸出的，是以也是可以handle的

delivering end-to-end exactly-once semantics was one of key goals behind the design of structured streaming.

to achieve that, we have designed the structured streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. every streaming source is assumed to have offsets (similar to kafka offsets, or kinesis sequence numbers) to track the read position in the stream. the engine uses checkpointing and write ahead logs to record the offset range of the data being processed in each trigger. the streaming sinks are designed to be idempotent for handling reprocessing. together, using replayable sources and idempotant sinks, structured streaming can ensure end-to-end exactly-once semantics under any failure.

首先依賴source是可以依據offset replay，而sink是幂等的，這樣隻需要通過write ahead logs記錄offset，checkpoint記錄state，就可以做到exactly once，因為本質是batch

Structured Streaming Programming Guide

繼續閱讀

Cloud Studio初體驗

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

sqlServer根據經緯查距離

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method

在python中建立excel并寫入