0x00 前言

位圖索引被廣泛用于資料庫和搜尋引擎中，通過利用位級并行，它們可以顯著加快查詢速度。但是，位圖索引會占用大量的記憶體，是以我們會更喜歡壓縮位圖索引。 Roaring Bitmaps 就是一種十分優秀的壓縮位圖索引，後文統稱 RBM。

壓縮位圖索引有很多種，比如基于 RLE（Run-Length

Encoding，運作長度編碼）的WAH (Word Aligned Hybrid Compression Scheme) 和 Concise (Compressed ‘n’ Composable Integer Set)。相比較前者， RBM 能提供更優秀的壓縮性能和更快的查詢效率。

0x01 用途

RBM 的用途和 Bitmap 很差不多（比如說索引），隻是說從性能、空間使用率各方面更優秀了。目前 RBM 已經在很多成熟的開源大資料平台中使用，簡單列幾個作為參考：

Apache Lucene and derivative systems such as Solr and Elasticsearch,
Metamarkets’ Druid,
Apache Spark,
Apache Hive,
eBay’s Apache Kylin,
……

總之 RBM 很優秀，大家都在用，學一學可能自己寫代碼用不到，但是對于了解這些常用的開源大資料系統沒有壞處。

0x02 原理

一、英文版

原理的話先直接上一段論文的原文，兩三段基本把整個 RBM 的設計思想給講清楚了。不想看英文了可以直接跳過看後面的中文總結。

We partition the range of 32-bit indexes ([0; n)) into chunks of 216 integers sharing the same 16 most significant digits. We use specialized containers to store their 16 least significant bits.

When a chunk contains no more than 4096 integers, we use a sorted array of packed 16-bit integers. When there are more than 4096 integers, we use a 216-bit bitmap. Thus, we have two types of containers: an array container for sparse chunks and a bitmap container for dense chunks. The 4096 threshold insures that at the level of the containers, each integer uses no more than 16 bits: we either use 216 bits for more than 4096 integers, using less than 16 bits/integer, or else we use exactly 16 bits/integer.

The containers are stored in a dynamic array with the shared 16 most-significant bits: this serves as a first-level index. The array keeps the containers sorted by the 16 most-significant bits.We expect this first-level index to be typically small: when n = 1 000 000, it contains at most 16 entries. Thus it should often remain in the CPU cache. The containers themselves should never use much more than 8 kB.

二、主要思想

RBM 的主要思想并不複雜，簡單來講，有如下三條：

我們将 32-bit 的範圍 ([0, n)) 劃分為 2^16 個桶，每一個桶有一個 Container 來存放一個數值的低16位；
在存儲和查詢數值的時候，我們将一個數值 k 劃分為高 16 位 (k % 2^16) 和低 16 位 (k mod 2^16) ，取高 16 位找到對應的桶，然後在低 16 位存放在相應的 Container 中；
容器的話， RBM 使用兩種容器結構： Array Container 和 Bitmap Container。Array Container 存放稀疏的資料，Bitmap Container 存放稠密的資料。即，若一個 Container 裡面的 Integer 數量小于 4096，就用 Short 類型的有序數組來存儲值。若大于 4096，就用 Bitmap 來存儲值。

如下圖，就是官網給出的一個例子，三個容器分别代表了三個資料集：

the list of the first 1000 multiples of 62
all integers [216, 216 + 100)
all even numbers in [2216, 3216)

0x03 舉個栗子

看完前面的還不知道在說什麼？沒關系，舉個栗子說明就好了。現在我們要将 821697800 這個 32 bit 的整數插入 RBM 中，整個算法流程是這樣的：

821697800 對應的 16 進制數為 30FA1D08，其中高 16 位為 30FA，低16位為 1D08。
我們先用二分查找從一級索引（即 Container Array）中找到數值為 30FA 的容器（如果該容器不存在，則建立一個），從圖中我們可以看到，該容器是一個 Bitmap 容器。
找到了相應的容器後，看一下低 16 位的數值 1D08，它相當于是 7432，是以在 Bitmap 中找到相應的位置，将其置為 1 即可。

是不是很簡單？然後換一個數值插入，比如說 191037，它的 16 進制的數值是 0002EA3D ，插入流程和前面的例子一樣，不同的就在于，高 16 位對應的容器是一個 Array Container，我們仍然用二分查找找到相應的位置再插入即可。

0x04 原理補充

RBM 的基本原理就這些，基于這種設計原理會有一些額外的操作要提一下。

請注意上文提到的一句話：

若一個 Container 裡面的 Integer 數量小于 4096，就用 Short 類型的有序數組來存儲值。若大于 4096，就用 Bitmap 來存儲值。

先解釋一下為什麼這裡用的 4096 這個門檻值？因為一個 Integer 的低 16 位是 2Byte，是以對應到 Arrary Container 中的話就是 2Byte * 4096 = 8KB；同樣，對于 Bitmap Container 來講，2^16 個 bit 也相當于是 8KB。

然後，基于前面提到的兩種 Container，在兩個 Container 之間的 Union (bitwise OR) 或者 Intersection (bitwise AND) 操作又會出現下面三種場景：

Bitmap vs Bitmap
Bitmap vs Array
Array vs Array

RBM 提供了相應的算法來高效地實作這些操作，比如下圖是 Bitmap vs Bitmap，這裡暫不再深入讨論，感興趣的可以看一下論文原文。

0xFF 總結

好了，RBM 的大緻原理就這些，不深入也沒有簡單代碼的實作。僅能做一個入門的參考。

本文是參考論文《Better bitmap performance with Roaring bitmaps》，該論文中提到的是 Bitmap 和 Array 兩種容器，算是包含了 RBM 的主要思想。然後，在另一篇論文《Consistently faster and smaller compressed bitmaps with Roaring》中會對 RBM 有更深入的探讨，并引入了一種新的容器： Run，感興趣的童鞋可以深入看一看。

不深入而淺出 Roaring Bitmaps 的基本原理0x00 前言0x01 用途0x02 原理0x03 舉個栗子0x04 原理補充0xFF 總結