https://cwiki.apache.org/confluence/display/flink/flink+improvement+proposals
when a task fails during execution, flink currently resets the entire execution graph and triggers complete re-execution from the last completed checkpoint. this is more expensive than just re-executing the failed tasks.
如果一個task失敗,目前需要完全停掉整個job恢複,這個明顯太重了;
proposal
簡單的方案,如果一個task失敗,就把和它相連的整條pipeline都重新開機,但如果所有node都和該task相連,那還是要重新開機整個job
但這個方案太naive了,是否可以盡量的減少重新開機的範圍?
如果要隻重新開機fail的task,以及後續的tasks,而不想重新開機源,隻有cache
每個node,把要發出去的intermediate result緩存下來,當一個node的task挂了後, 隻需要從上一層node把intermediate result從發出來,就可以避免從source重新開機
至于如何cache intermediate result,在memory還是disk,還是其他,隻是方案不同
caching intermediate result
this type of data stream caches all elements since the latest checkpoint, possibly spilling them to disk, if the data exceeds the memory capacity.
when a downstream operator restarts from that checkpoint, it can simply re-read that data stream without requiring the producing operator to restart. applicable to both batch (bounded) and streaming (unbounded) operations. when no checkpoints are used (batch), it needs to cache all data.
memory-only caching intermediate result
similar to the caching intermediate result, but discards sent data once the memory buffering capacity is exceeded. acts as a “best effort” helper for recovery, which will bound recovery when checkpoints are frequent enough to hold data in between checkpoints in memory. on the other hand, it comes absolutely for free, it simply used memory that would otherwise not be used anyways.
blocking intermediate result
this is applicable only to bounded intermediate results (batch jobs). it means that the consuming operator starts only after the entire bounded result has been produced. this bounds the cancellations/restarts downstream in batch jobs.
right now, in flink a windowfunction does not get a lot of information when a window fires.
the signature of windowfunction is this:
<code>public</code> <code>interface</code> <code>windowfunction<in, out, key, w </code><code>extends</code> <code>window> </code><code>extends</code> <code>function, serializable {</code>
<code></code><code> void</code> <code>apply(key key, w window, iterable<in> input, collector<out> out);</code>
<code>}</code>
i.e , the user code only has access to the key for which the window fired, the window for which we fired and the data of the window itself. in the future, we might like to extend the information available to the user function. we initially propose this as additional information:
why/when did the window fire. did it fire on time, i.e. when the watermark passed the end of the window. did it fire early because of a speculative early trigger or did it fire on late-arriving data.
how many times did we fire before for the current window. this would probably be an increasing index, such that each firing for a window can be uniquely identified.
目前在window functions中暴露出來的資訊不夠,需要給出更多的資訊,比如why,when fire等
right now, the ability of window evictor is limited
the evictor is called only before the windowfunction. (there can be use cases where the elements have to be evicted after the windowfunction is applied)
elements are evicted only from the beginning of the window. (there can be cases where we need to allow eviction of elements from anywhere within in the window as per the eviction logic that user wish to implement)
目前evictor隻是在windowfunction 之前被執行,是否可以在windowfunction 之後被執行?
目前的接口隻是從beginning of the window開始,是否可以從任意位置開始evict
problem:
we experience some unexpected increase of data sent over the network for broadcasts with increasing number of slots per task manager.
降低在廣播時,發送的備援資料
目前狀況是,
要達到的效果是,
每個taskmanager隻發送一次
核心想法是,把jobmanager的工作分離出來
增加兩個新的子產品, resourcemanager和dispatcher
the resourcemanager (introduced in flink 1.1) is the cluster-manager-specific component. there is a generic base class, and specific implementations for:
yarn
mesos
standalone-multi-job (standalone mode)
self-contained-single-job (docker/kubernetes)
顯然對于不同的資源管理平台,隻需要實作不同的resourcemanager
于是jobmanager, taskmanager和resourcemanager之間的關系就變成這樣
taskmanager,向resourcemanager進行注冊,并定期彙報solts的情況
jobmanager會向resourcemanager請求slot,然後resourcemanager會選擇taskmanager,告訴它向某jobmanager提供slots
然後該taskmanager會直接聯系jobmanager去提供slots
同時jobmanager會有slot pool,來保持申請到的slots
the slotpool is a modification of what is currently the instancemanager.
這樣就算resourcemanager挂掉了,jobmanager仍然可以繼續使用已經申請的slots
the new design includes the concept of a dispatcher. the dispatcher accepts job submissions from clients and starts the jobs on their behalf on a cluster manager.
in the future run, the dispatcher will also help with the following aspects:
the dispatcher is a cross-job service that can run a long-lived web dashboard
future versions of the dispatcher should receive only http calls and thus can act as a bridge in firewalled clusters
the dispatcher never executes code and can thus be viewed as a trusted process. it can run with higher privileges (superuser credentials) and spawn jobs on behalf of other users (acquiring their authentication tokens). building on that, the dispatcher can manage user authentications
把dispatcher從jobmanager中分離出來的好處,
首先dispatcher是可以跨cluster的,是個long-lived web dashboard,比如後面如果一個cluster或jobmanager挂了,我可以簡單的spawn到另外一個
第二,client到dispatcher是基于http的很容易穿過防火牆
第三,dispatcher可以當作類似proxy的作用,比如authentications
是以對于不同的cluster manager的具體架構如下,
yarn,
compared to the state in flink 1.1, the new flink-on-yarn architecture offers the following benefits:
the client directly starts the job in yarn, rather than bootstrapping a cluster and after that submitting the job to that cluster. the client can hence disconnect immediately after the job was submitted
all user code libraries and config files are directly in the application classpath, rather than in the dynamic user code class loader
containers are requested as needed and will be released when not used any more
the “as needed” allocation of containers allows for different profiles of containers (cpu / memory) to be used for different operators
這個架構把整個flink都托管在yarn内部
好處,
你不需要先拉起flink叢集,然後再送出job,隻需要直接送出job;yarn的resourcmanager會先拉起application master,其中包含resource manager和job manager;然後當flink resource manager需要資源時,會先和yarn resourcemanager請求,它會去建立container,其中包含taskmanager;
mesos,
這個架構和yarn類似,
mesos-specific fault tolerance aspects
resourcemanager and jobmanager run inside a regular mesos container. the dispatcher is responsible for monitoring and restarting those containers in case they fail. the dispatcher itself must be made highly available by a mesos service like marathon
standalone,
the standalone setup is should keep compatibility with current standalone setups.
the role of the long running jobmanager is now a “local dispatcher” process that spawns jobmanagers with jobs internally. the resourcemanager lives across jobs and handles taskmanager registration.
for highly-available setups, there are multiple dispatcher processes, competing for being leader, similar as the currently the jobmanagers do.
更具體的步驟,
with the introduction of the metric system it is now time to make it easily accessible to users. as the webinterface is the first stop for users for any details about flink, it seems appropriate to expose the gathered metrics there as well.
the changes can be roughly broken down into 4 steps:
create a data-structure on the job-/taskmanager containing a metrics snapshot
transfer this snapshot to the webinterface back-end
store the snapshot in the webruntimemonitor in an easily accessible way
expose the stored metrics to the webinterface via rest api
要解決的問題是,當dynamic scaling的時候,如何解決狀态的問題
如果沒有狀态,動态的scaling,需要做的隻是把流量分到新的operator的并發上
但是對于狀态,當增加并發的時候,需要把狀态切分,而減少并發的時候,需要把狀态合并
這個就比較麻煩了
同時在flink裡面,狀态分為3部分,operator state, the function state and key-value states
這裡基本的思想,就是更細粒度的checkpoint;
原先是以task級别為粒度,這樣加載的時候,隻能加載一個task,如果一個task擴成2個,task級别的checkpoint也需要切分
而采用更細粒度的checkpoint獨立存儲,而不依賴task,這樣就可以獨立于task進行排程
比如對于key-value,建立一個叫key groups的概念,以key group作為一個checkpoint的單元
in order to efficiently distribute key-value states across the cluster, they should be grouped into key groups. each key group represents a subset of the key space and is checkpointed as an independent unit. the key groups can then be re-assigned to different tasks if the dop changes.
這樣當發生增減operator的并發度的時候,隻需要以key group為機關排程到新的operator上,同時在該operator上恢複相應的checkpoint即可,如圖
然後,對于non-partitioned operator and function state,這個問題怎麼解
比如對于kafkasource,4個partitions,2個source的并發
scaling down後,就會出現下圖,左邊的情況,因為隻有s1 task了,他隻會load他自己的checkpoint,而之前s2的checkpoint就沒人管了
而理論上,我們是要達到右邊的情況的
scaling up後,也會出現下圖左邊的case,因為s1,s2加載了原來的checkpoint,但是目前其實partition3,partition4已經不再配置設定到s2了
思路還是一樣,把checkpoint的粒度變細,而不依賴于task,
目前支援的trigger方式不夠靈活,而且對late element隻能drop,需要設計更為靈活和合理的dsl,用于描述trigger policy
currently checkpoints and savepoints are handled in slightly different ways with respect to storing and restoring them. the main differences are that savepoints 1) are manually triggered, 2) persist checkpoint meta data, and 3) are not automatically discarded.
with this flip, i propose to allow to unify checkpoints and savepoints by allowing savepoints to be triggered automatically.
the table api is a declarative api to define queries on static and streaming tables. so far, only projection, selection, and union are supported operations on streaming tables. this flip proposes to add support for different types of aggregations on top of streaming tables. in particular, we seek to support:
group-window aggregates, i.e., aggregates which are computed for a group of elements. a (time or row-count) window is required to bound the infinite input stream into a finite group.
row-window aggregates, i.e., aggregates which are computed for each row, based on a window (range) of preceding and succeeding rows.
each type of aggregate shall be supported on keyed/grouped or non-keyed/grouped data streams for streaming tables as well as batch tables.
since time-windowed aggregates will be the first operation that require the definition of time, this flip does also discuss how the table api handles time characteristics, timestamps, and watermarks.
i/o access, for the most case, is a time-consuming process, making the tps for single operator much lower than in-memory computing, particularly for streaming job, when low latency is a big concern for users. starting multiple threads may be an option to handle this problem, but the drawbacks are obvious: the programming model for end users may become more complicated as they have to implement thread model in the operator. furthermore, they have to pay attention to coordinate with checkpointing.
流最終會碰到外部存儲,那就會有io瓶頸,比如寫資料庫,那麼會阻塞整個流
這個問題怎麼解?可以用多線程,這個會讓程式設計模型比較複雜,說白了,不夠優雅
是以解決方法其實就是用reactor模型,典型的解決i/o等待的方法
asyncfunction: async i/o will be triggered in asyncfunction.
asyncwaitoperator: an streamoperator which will invoke asyncfunction.
asynccollector: for each input streaming record, an asynccollector will be created and passed into user's callback to get the async i/o result.
asynccollectorbuffer: a buffer to keep all asynccollectors.
emitter thread: a working thread in asynccollectorbuffer, being signalled while some of asynccollectors have finished async i/o and emitting results to the following opeartors.
對于普通的operator,調用function,然後把資料用collector發送出去
但對于asyncwaitoperator,無法直接得到結果,是以把asynccollector傳入callback,當function觸發callback的時候,再emit資料
但這樣有個問題,emit的順序是取決于,執行速度,如果對emit順序沒有要求應該可以
但如果模拟同步的行為,理論上,emit的順序應該等同于收到的順序
這樣就需要一個buffer,去cache所有的asynccollector,即asynccollectorbuffer
當callback被執行時,某個asynccollector會被填充上資料,這個時候被mark成可發送
但是否發送,需要依賴一個外部的emitter
他會check,并決定是否真正的emit這個asynccollector,比如check是否它之前的所有的都已經emit,否則需要等待
這裡還需要考慮的是, watermark,它必須要等到前面的資料都已經被成功emit後,才能被emit;這樣才能保證一緻性