天天看點

HBase學習筆記(二)HBase架構

HBase Architectural Components(HBase架構元件)

HBase架構也是主從架構,由三部分構成HRegionServer、HBase Master和ZooKeeper。

RegionServer負責資料的讀寫與用戶端互動,對于region的操作則是由HMaster處理,ZooKeeper則是負責維護運作中的節點。

在底層,它将資料存儲于HDFS檔案中,因而涉及到HDFS的NN、DN等。RegionServer會搭配HDFS中的DataNode節點,可以将資料放在本節點的DataNode上。NameNode則是維護每個實體資料塊的中繼資料資訊。

剛開始隻要大緻了解一下每個元件是幹啥的,後面會進行詳細介紹。

Physically, HBase is composed of three types of servers in a master slave type of architecture. Region servers serve data for reads and writes. When accessing data, clients communicate with HBase RegionServers directly. Region assignment, DDL (create, delete tables) operations are handled by the HBase Master process. Zookeeper, which is part of HDFS, maintains a live cluster state.

The Hadoop DataNode stores the data that the Region Server is managing. All HBase data is stored in HDFS files. Region Servers are collocated with the HDFS DataNodes, which enable data locality (putting the data close to where it is needed) for the data served by the RegionServers. HBase data is local when it is written, but when a region is moved, it is not local until compaction.

The NameNode maintains metadata information for all the physical data blocks that comprise the files.

Regions

上一篇文章說到HBase的一張表會被切分成若幹塊,每塊叫做一個Region。每個Region中存儲着從startKey到endKey中間的記錄。這些Region會被分到叢集的各個節點上去存儲,每個節點叫做一個RegionServer,這些節點負責資料的讀寫,一個RegionServer可以處理大約1000個regions。

HBase Tables are divided horizontally by row key range into “Regions.” A region contains all rows in the table between the region’s start key and end key. Regions are assigned to the nodes in the cluster, called “Region Servers,” and these serve data for reads and writes. A region server can serve about 1,000 regions.

HBase HMaster

HMaster的功能是處理region的協調工作,具體包括以下的内容:

  1. 管理RegionServer,實作其負載均衡。
  2. 管理和配置設定Region,比如在Region split時配置設定新的Region;在RegionServer退出時遷移其内的Region到其他RegionServer上。
  3. 監控叢集中所有RegionServer的狀态(通過Heartbeat和監聽ZooKeeper中的狀态)
  4. 處理schema更新請求 (建立、删除、修改Table的定義)。
Region assignment, DDL (create, delete tables) operations are handled by the HBase Master.

A master is responsible for:

  • Coordinating the region servers

- Assigning regions on startup , re-assigning regions for recovery or load balancing

- Monitoring all RegionServer instances in the cluster (listens for notifications from zookeeper)

  • Admin functions
  • - Interface for creating, deleting, updating tables

    ZooKeeper: The Coordinator

    ZooKeeper作為協調者,負責維護節點的運作狀态,也就是哪個節點是運作中的,哪個是已經挂了的。每個節點周期性地像它發送心跳資訊,進而讓它能時刻了解到每個節點的運作情況。如果發現有節點出現異常情況,它會發出提醒。
    HBase uses ZooKeeper as a distributed coordination service to maintain server state in the cluster. Zookeeper maintains which servers are alive and available, and provides server failure notification. Zookeeper uses consensus to guarantee common shared state. Note that there should be three or five machines for consensus.

    How the Components Work Together(各元件的協調工作)

    那麼這些元件是怎麼協同工作的呢?

    ZooKeeper作為一個關系協調者,協調整個系統中各個元件的一些狀态資訊。Region servers和運作中的HMaster會跟ZooKeeper建立一個會話連接配接。ZooKeeper通過心跳資訊來維持一個正在進行會話的短時節點(ephemeral node)(還是直接看英文吧,翻譯總感覺說不清楚想要表達的意思)。

    每個RegionServer會建立一個短時節點(ephemeral node),而HMaster會監控這些節點來尋找空閑的節點,同時它也會檢測這些節點當中是不是有節點已經挂了。HMaster會競争建立短時節點,ZooKeeper會決定哪個HMaster作為主節點,并且時刻保持一個時間段隻有一個活躍的HMaster。這個活躍的HMaster會向ZooKeeper發送心跳資訊,而在之前競争中失敗的HMaster就是非活躍的HMaster,他們時刻都想着取代那個唯一的活躍的HMaster,是以他們就一直注意監聽ZooKeeper中有沒有那個活躍的HMaster挂了的通知。突然發現這個系統跟人類社會很相似啊,一個身居高位的人,底下有無數的人在盼望着他出醜下台進而自己取代他。

    如果一個RegionServer或者那個活躍的HMaster沒有發送心跳資訊給ZooKeeper,那麼建立的這個會話就會被關閉,與這些出問題節點有關的所有短時節點都會被删除。會有相應的監聽器監聽這些資訊進而通知這些需要被删除的節點。因為活躍的那個HMaster會監聽這些RegionServer的狀态,是以一旦他們出問題,HMaster就會想辦法恢複這些錯誤。如果挂掉的是HMaster,那麼很簡單,底下那些不活躍的HMaster會第一時間收到通知并且開始競争這個“崗位”。

    總結一下這個過程:

    1. 在HMaster和RegionServer連接配接到ZooKeeper後建立Ephemeral節點,并使用Heartbeat機制維持這個節點的存活狀态,如果某個Ephemeral節點失效,則HMaster會收到通知,并做相應的處理。
    2. HMaster通過監聽ZooKeeper中的Ephemeral節點來監控HRegionServer的加入或當機。
    3. 在第一個HMaster連接配接到ZooKeeper時會建立Ephemeral節點來表示Active的HMaster,其後加進來的HMaster則監聽該Ephemeral節點,如果目前Active的HMaster當機,則該節點消失,因而其他HMaster得到通知,而将自身轉換成Active的HMaster,在變為Active的HMaster之前,它會建立在/hbase/back-masters/下建立自己的Ephemeral節點
    Zookeeper is used to coordinate shared state information for members of distributed systems. Region servers and the active HMaster connect with a session to ZooKeeper. The ZooKeeper maintains ephemeral nodes for active sessions via heartbeats.

    Each Region Server creates an ephemeral node. The HMaster monitors these nodes to discover available region servers, and it also monitors these nodes for server failures. HMasters vie to create an ephemeral node. Zookeeper determines the first one and uses it to make sure that only one master is active. The active HMaster sends heartbeats to Zookeeper, and the inactive HMaster listens for notifications of the active HMaster failure.

    If a region server or the active HMaster fails to send a heartbeat, the session is expired and the corresponding ephemeral node is deleted. Listeners for updates will be notified of the deleted nodes. The active HMaster listens for region servers, and will recover region servers on failure. The Inactive HMaster listens for active HMaster failure, and if an active HMaster fails, the inactive HMaster becomes active.

    HBase First Read or Write

    HBase中有個特殊的日志表叫做META table,這個表裡存儲着每個region在叢集的哪個節點上,回想一下region是什麼,它是一張完整的表被切分的每個資料塊。這張表的位址存儲在ZooKeeper上,也就是說這張表實際上是存在RegionServer中的,但具體是哪個RegionServer,隻有ZooKeeper知道。

    當用戶端要讀寫資料的時候,無法避免的一個問題就是,我要通路的資料在哪個節點上或者要寫到哪個節點上?這個問題的答案就與META table有關。

    第一步,用戶端要先去ZooKeeper找到這這表存在哪個RegionServer上了。

    第二步,去那個RegionServer上找,怎麼找呢?當然是根據要通路資料的key來找。找到後用戶端會将這些資料以及META的位址存儲到它的緩存中,這樣下次再找到相同的資料時就不用再執行前面的步驟了,直接去緩存中找就完成了,如果緩存裡沒找到,再根據緩存的META位址去查META表就行了。

    第三步,很簡單,第二步已經知道鍵為key的資料在哪個RegionServer上了,直接去找就好了。

    There is a special HBase Catalog table called the META table, which holds the location of the regions in the cluster. ZooKeeper stores the location of the META table.
    This is what happens the first time a client reads or writes to HBase:
    1. The client gets the Region server that hosts the META table from ZooKeeper.
    2. The client will query the .META. server to get the region server corresponding to the row key it wants to access. The client caches this information along with the META table location.
    3. It will get the Row from the corresponding Region Server.
    For future reads, the client uses the cache to retrieve the META location and previously read row keys. Over time, it does not need to query the META table, unless there is a miss because a region has moved; then it will re-query and update the cache.

    HBase Meta Table

    這個META表具體是什麼呢?我們一起來看看,它裡面是一個所有Region位址的清單。使用的資料結構是b樹,key和value分别如下圖所示:
    • This META table is an HBase table that keeps a list of all regions in the system.
    • The .META. table is like a b tree.
    • The .META. table structure is as follows:

    - Key: region start key,region id

    - Values: RegionServer

    Region Server Components(RegionServer組成)

    下面來了解一下RegionServer的組成。一個RegionServer由WAL、BlockCache、MemStore和HFile構成。

    WAL:全程是Write Ahead Log,其實就是個日志檔案,存儲在分布式系統上。這個日志是用來存儲那些還沒有被寫入硬碟永久儲存的資料,是用來進行資料恢複的。換句話說,無論要寫什麼資料,都要現在這個日志裡登記一下新資料,道理很簡單,如果沒有日志檔案,那麼如果在寫資料的時候出問題了,比如停電啊這種故障,那麼資料庫如何恢複資料呢?而先寫了日志之後,資料沒寫入硬碟沒關系,直接通過日志就知道之前進行了哪些操作,通過對比日志和你現在資料庫中的值就能知道是在哪個地方出故障的,接下來就按照日志的記錄往後一步步執行就好了,資料庫的資料就能恢複了。

    BlockCache:它是讀緩存,它存儲了被經常讀取的資料,使用的是LRU算法,也就是說,緩沖區滿了之後,最近最少被使用的資料會被淘汰。

    MemStore:是寫緩存,用于存儲還沒有被寫入到硬碟的資料,并且是排序過的。每一個列族對應一個MemStore。

    HFile:它存儲了按照KeyValue格式的每行資料。

    A Region Server runs on an HDFS data node and has the following components:
    • WAL: Write Ahead Log is a file on the distributed file system. The WAL is used to store new data that hasn’t yet been persisted to permanent storage; it is used for recovery in the case of failure.
    • BlockCache: is the read cache. It stores frequently read data in memory. Least Recently Used data is evicted when full.
    • MemStore: is the write cache. It stores new data which has not yet been written to disk. It is sorted before writing to disk. There is one MemStore per column family per region.
    • Hfiles store the rows as sorted KeyValues on disk.

    HBase Write Steps(HBase的寫入步驟)

    當用戶端發起一個put請求的時候,首先根據RowKey尋址,從META表中查出該Put資料最終需要去的RegionServer。

    然後就到了下面的這個圖,用戶端把這個put操作先寫到RegionServer的WAL日志檔案中。

    一旦寫入到日志成功後,RegionServer會根據put請求中的TableName和RowKey找到對應的Region,然後再根據column family找到該列族對應的MemStore,将資料寫入MemStore。

    最後,資料寫成功後,給用戶端一個應答,表明資料已經寫好了 。

    HBase Write Steps (1)

    When the client issues a Put request, the first step is to write the data to the write-ahead log, the WAL:

    - Edits are appended to the end of the WAL file that is stored on disk.

    - The WAL is used to recover not-yet-persisted data in case a server crashes.

    HBase Write Steps (2)

    Once the data is written to the WAL, it is placed in the MemStore. Then, the put request acknowledgement returns to the client.

    HBase MemStore

    上一步我們說到資料會寫到MemStore中,而MemStore是一個寫緩沖區,什麼意思呢?就是寫操作會先将資料寫到緩沖區中,而每條資料在緩沖區中是KeyValue結構的,并且是按照key排序的。
    The MemStore stores updates in memory as sorted KeyValues, the same as it would be stored in an HFile. There is one MemStore per column family. The updates are sorted per column family.

    HBase Region Flush

    緩沖區有一定的大小,如果緩沖區滿了,那麼緩沖區的資料就會被flush到一個新的HFile檔案中永久儲存。HBase中,對于每個列族可以有多個HFile,HFile裡的資料跟緩沖區中的格式也是一樣的,都是一個Key對應一個Value的結構。

    需要注意的是,如果一個MemStore滿了,那麼所有的MemStore都要将存儲的資料flush到HFile中,這也就是為什麼官方建議一個表中不要有太多列族,因為每個列族對應一個MemStore,如果列族太多的話會導緻頻繁重新整理緩沖區等性能問題。

    緩沖區在重新整理寫入到HFile的時候,還會儲存一個序列數(sequence number),這個東西是幹嘛的呢?其實是為了讓系統知道目前HFile上儲存了哪些資料。這個序列數作為中繼資料(meta field)存在HFile中,是以每次在重新整理的時候都等于給HFile做了個标記。

    When the MemStore accumulates enough data, the entire sorted set is written to a new HFile in HDFS. HBase uses multiple HFiles per column family, which contain the actual cells, or KeyValue instances. These files are created over time as KeyValue edits sorted in the MemStores are flushed as files to disk.

    Note that this is one reason why there is a limit to the number of column families in HBase. There is one MemStore per CF; when one is full, they all flush. It also saves the last written sequence number so the system knows what was persisted so far.

    The highest sequence number is stored as a meta field in each HFile, to reflect where persisting has ended and where to continue. On region startup, the sequence number is read, and the highest is used as the sequence number for new edits.

    HBase HFile

    緩沖區的資料都是根據key進行排序的,是以在flush到HFile上的時候,就直接按書序一條一條記錄往下寫就行了,這樣順序寫的過程是非常快速的,因為他避免了磁盤磁頭的移動。
    Data is stored in an HFile which contains sorted key/values. When the MemStore accumulates enough data, the entire sorted KeyValue set is written to a new HFile in HDFS. This is a sequential write. It is very fast, as it avoids moving the disk drive head.

    HBase HFile Structure

    HFile的組成就相對來說比較複雜了,因為要考慮到查詢的性能,最好别出現把整個檔案都掃描一遍後才發現要通路的資料不再這個HFile中的情況。是以就需要在檔案的組織形式上花點心思。怎樣在不完全掃描檔案的情況下知道要通路的資料在不再檔案中呢?我們想到的答案可能是使用索引(Index)。HFile實際上也是這種思想,它使用的是多級索引,在形式上類似于b樹。不得不感慨一句,b樹這個資料結構在資料庫中的應用程度真的是很高啊!
    An HFile contains a multi-layered index which allows HBase to seek to the data without having to read the whole file. The multi-level index is like a b+tree:
    • Key value pairs are stored in increasing order
    • Indexes point by row key to the key value data in 64KB “blocks”
    • Each block has its own leaf-index
    • The last key of each block is put in the intermediate index
    • The root index points to the intermediate index
    The trailer points(位于檔案最後) to the meta blocks, and is written at the end of persisting the data to the file. The trailer also has information like bloom filters(布隆過濾器) and time range info. Bloom filters help to skip files that do not contain a certain row key. The time range info is useful for skipping the file if it is not in the time range the read is looking for.

    HFile Index

    當一個HFile被打開的時候,這個HFile的索引就被加載到BlockCache中了,還記得我們之前說BlockCache是什麼嗎?就是讀緩沖區。
    The index, which we just discussed, is loaded when the HFile is opened and kept in memory. This allows lookups to be performed with a single disk seek.

    HBase Read Merge(HBase讀操作)

    上面我們已經讨論了HBase的存儲結構,在涉及讀取操作的時候其實有個問題,一行資料可能存在的位置有哪些?一種是已經永久存儲在HFile中了,一種是還沒來得及寫入到HFile,在緩沖區MemStore中,還有一種情況是讀緩存,也就是經常讀取的資料會被放到讀緩存BlockCache中,那麼讀取資料的操作怎麼去查詢資料的存儲位置呢?有這麼幾個步驟:

    首先,設定讀緩存的位置是什麼?當時是為了高效地讀取資料,是以讀緩存絕對是第一優先級的,别忘了BlockCache中使用的是LRU算法。

    其次,HFile和寫緩存,選哪個?HFile的個數那麼多,當然是效率最低的,而一個列族隻有一個MemStore,效率必然比HFile高的多,是以它作為第二優先級,也就是說如果讀緩存中沒有找到資料,那麼就去MemStore中去找。

    最後,如果不幸上面兩步都沒能找到資料,那沒辦法隻能去HFile上找了。

    We have seen that the KeyValue cells corresponding to one row can be in multiple places, row cells already persisted are in Hfiles, recently updated cells are in the MemStore, and recently read cells are in the Block cache. So when you read a row, how does the system get the corresponding cells to return? A Read merges Key Values from the block cache, MemStore, and HFiles in the following steps:
    1. First, the scanner looks for the Row cells in the Block cache - the read cache. Recently Read Key Values are cached here, and Least Recently Used are evicted when memory is needed.
    2. Next, the scanner looks in the MemStore, the write cache in memory containing the most recent writes.
    3. If the scanner does not find all of the row cells in the MemStore and Block Cache, then HBase will use the Block Cache indexes and bloom filters to load HFiles into memory, which may contain the target row cells.
    As discussed earlier, there may be many HFiles per MemStore, which means for a read, multiple files may have to be examined, which can affect the performance. This is called read amplification.

    HBase Minor Compaction

    剛才我們說到寫資料的時候會先寫到緩沖區,緩沖區滿了會将緩沖區的内容沖刷到HFile中永久儲存,試想這個寫資料的過程一直持續,那麼HFile的數量會越來越多,管理起來就會不太友善,就有了compaction這個操作,意思是壓緊壓實的意思,實在是沒找到合适的中文翻譯,這個名詞就不翻了。

    HBase有兩種compaction,一種是Minor Compaction,另一種是Major Compaction。

    Minor Compaction是将一些小的HFile檔案合并成大檔案,很明顯它可以減少HFile檔案的數量,并且在這個過程中不會處理已經Deleted或Expired的Cell。一次Minor Compaction的結果是更少并且更大的HFile。

    HBase will automatically pick some smaller HFiles and rewrite them into fewer bigger Hfiles. This process is called minor compaction. Minor compaction reduces the number of storage files by rewriting smaller files into fewer but larger ones, performing a merge sort.

    HBase Major Compaction

    Major Compaction是指将所有屬于一個Region的HFile合并成一個HFile,也就是将同一列族的多個檔案合并,在這個過程中,标記為Deleted的Cell會被删除,而那些已經Expired的Cell會被丢棄,那些已經超過最多版本數的Cell會被丢棄。但是這個合并操作會非常耗時。

繼續閱讀