用Redis快速實作BloomFilter！

背景

最近工作上有個類似需求是：現有約3億條資料詞典存在于一個csv檔案A中，作為資料源。對于使用者輸入的任意單詞M，需要快速的在A中比對M單詞是否存在。（A檔案約3G大小左右，總行數三億）

拿到這個需求，你的第一想法怎麼做呢？

正常思路可能是：

将csv檔案A導入某關系型資料庫。

sql查詢按M比對。

上面的方式有個明顯的缺點是：慢！

3億多行的資料，即便是建好索引進行檢索，比對到也得話不少時間（筆者沒親自試過，感興趣的朋友可以自行測試測試，理論上快不起來的）。

目前能在時間複雜度和空間複雜度上達到最佳的方案，恐怕就是Bloom Filter了，維基位址：Bloom Filter

此處給不太了解Bloom Filter的讀者看，熟悉的朋友直接看下一節。

本文場景Bloom Filter 使用思路解釋：

假設申請了一段bit位大數組（即數組中的元素隻能是一個bit位，1或0，預設元素值都為0）

将csv檔案A中的每個單詞，經過多個hash函數進行hash運算之後得到在大數組中對應的多個下标位置

将步驟2中得到的多個下标位置的bit位都置為1.

對于使用者輸入的任意單詞M，按照2的步驟得到多個下标位置，其對應大數組中的值全部為1則存在，否則不存在。

方案選型

實作Bloom Filter的方法很多，有各種語言版本的，這裡為了真切感受一下算法的魅力，筆者這裡決定用java 代碼徒手撸了！

另一方面，考慮到分布式應用的需要，顯然在單機記憶體上建構Bloom Filter存儲是不太合适的。這裡選擇redis。

redis有以下為操作，可以用于實作bloomfilter：

redis> SETBIT bit 10086 1

(integer) 0

redis> GETBIT bit 10086

(integer) 1

redis> GETBIT bit 100 # bit 預設被初始化為 0

實作細節

實作bloom filter的關鍵是hash函數，一般為了降低誤報率、減少hash碰撞的影響，會選擇多個hash函數。

那麼，怎麼寫一個hash函數呢？

不要方，我們要的hash是 input: String --> output: int , jdk裡面的String類不是恰好也有一個hashCode 方法嗎？翻出來看一看！

/**

* Returns a hash code for this string. The hash code for a
 * {@code String} object is computed as
 * <blockquote><pre>
 * s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
 * </pre></blockquote>
 * using {@code int} arithmetic, where {@code s[i]} is the
 * <i>i</i>th character of the string, {@code n} is the length of
 * the string, and {@code ^} indicates exponentiation.
 * (The hash value of the empty string is zero.)
 *
 * @return  a hash code value for this object.
 */
public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;
        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

看到這一行h = 31 * h + val[i];，貌似原理其實也很簡單，每個

遊戲賣号

字元對應的ascii碼，經過一個公式計算依次加起來。這裡有個系數31, 稍微變一下，不就可以有多個hash函數了嗎。

以下是稍加修改後的hash函數：

//總的bitmap大小  64M
private static final int cap = 1 << 29;
/*
 * 不同哈希函數的種子，一般取質數
 * seeds數組共有8個值，則代表采用8種不同的哈希函數
 */
private int[] seeds = new int[]{3, 5, 7, 11, 13, 31, 37, 61};
private int hash(String value, int seed) {
    int result = 0;
    int length = value.length();
    for (int i = 0; i < length; i++) {
        result = seed * result + value.charAt(i);
    }
    return (cap - 1) & result;
}

剩下的事情便很簡單了，對每個詞典A中的單詞，依次調seeds 中對應的hash函數（這裡一共是8個），用redis的setbit操作，将下标值置為1.

redis代碼 (這裡用pipeline 包裝了下。)

@Service

public class RedisService {

@Autowired
private StringRedisTemplate template;
public void multiSetBit(String name, boolean value, long... offsets) {
    template.executePipelined((RedisCallback<Object>) connection -> {
        for (long offset : offsets) {
            connection.setBit(name.getBytes(), offset, value);
        }
        return null;
    });
}
public List<Boolean> multiGetBit(String name, long... offsets) {
    List<Object> results = template.executePipelined((RedisCallback<Object>) connection -> {
        for (long offset : offsets) {
            connection.getBit(name.getBytes(), offset);
        }
        return null;
    });
    List<Boolean> list = new ArrayList<>();
    results.forEach(obj -> {
        list.add((Boolean) obj);
    });
    return list;
}

}

最後，代碼串起來大概長這個樣子：

FileInputStream inputStream = new FileInputStream("/XXXX.csv");

BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));

HashSet totalSet=new HashSet<>();

String word=null;

while((word = bufferedReader.readLine()) != null){

for (int seed : seeds) {
    int hash = hash(word, seed);
    totalSet.add((long) hash);
}
long[] offsets = new long[totalSet.size()];
int i=0;
for(Long l:totalSet){
    offsets[i++]=l;
}
redisService.multiSetBit("BLOOM_FILTER_WORDS_DICTIONARY", true, offsets);

查的時候也類似：

String word = "XXXX"; //實際輸入

long[] offsets = new long[seeds.length];

for (int i = 0; i < seeds.length; i++) {

int hash = hash(mobile, seeds[i]);
offsets[i] = hash;

List results = redisService.multiGetBit("BLOOM_FILTER_WORDS_DICTIONARY", offsets);

//判斷是否都為true （則存在)

boolean isExisted=true;

for(Boolean result:results){

if(!result){
    isExisted=false;
    break;
}

注意事項

setbit的offset是用大小限制的，在0到 232（最大使用512M記憶體）之間，即0~4294967296之前，超過這個數會自動将offset轉化為0，是以使用的時候一定要注意。

用Redis快速實作BloomFilter！

繼續閱讀

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

Dijkstra--簡易版（最短路徑）

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

sqlServer根據經緯查距離

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method

hdu7108哈希