1 什麼是Trie樹
1.1 Trie樹的概念
Trie樹,即字典樹,又稱單詞查找樹或鍵樹,是一種樹形結構,典型應用是用于統計,排序和儲存大量的字元串(但不僅限于字元串),是以經常被搜尋引擎系統用于文本詞頻統計。它的優點是:利用字元串的公共字首來減少查詢時間,最大限度地減少無謂的字元串比較,查詢效率比哈希樹高。
Trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is a kind of search tree—an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. It is one of those data-structures that can be easily implemented.
1.2 Trie樹優點
最大限度地減少無謂的字元串比較,查詢效率比較高。核心思想是空間換時間,利用字元串的公共字首來降低查詢時間的開銷以達到提高效率的目的。
- 插入、查找的時間複雜度均為O(N),其中N為字元串長度。
- 空間複雜度是26^n級别的,非常龐大(可采用雙數組實作改善)。
1.3 Trie樹的三個基本性質
- 根節點不包含字元,除根節點外每一個節點都隻包含一個字元
- 從根節點到某一節點,路徑上經過的字元連接配接起來,為該節點對應的字元串
- 每個節點的所有子節點包含的字元都不相同
2 Trie樹資料結構
以字元串”hi”和”經海路”的資料為例:
Java的資料結構定義:
@Data
public class TrieTreeNode {
private Character data;
private Map<Character, TrieTreeNode> children;
private boolean isEnd;
// 字首,備援資訊,可選
private String prefix;
// 字尾,備援資訊,可選
private String suffix;
}
複制代碼
如果隻是處理26個英文字元,data可以通過children數組是否為空來判斷。如果處理程式,預設children為空來判斷是否為最後一個節點,則isEnd字段可以省略。
字首prefix和suffix可以在資料處理的時候,友善拿到目前節點字首和字尾資訊,如果不需要可以去除。
3 Trie樹在髒話過濾中的應用
3.1 髒話關鍵詞Keyword定義
public class KeyWord implements Serializable {
/**
* 關鍵詞内容
*/
private String word;
//其他省略
}
複制代碼
3.2 關鍵詞查詢器
public class KWSeeker {
/**
* 所有的關鍵詞
*/
private Set<KeyWord> sensitiveWords;
/**
* 關鍵詞樹
*/
private Map<String, Map> wordsTree = new ConcurrentHashMap<String, Map>();
/**
* 最短的關鍵詞長度。用于對短于這個長度的文本不處理的判斷,以節省一定的效率
*/
private int wordLeastLen = 0;
//其他處理方法,省略
}
複制代碼
3.3 字元串構造一棵樹
/**
* 将指定的詞構造到一棵樹中。
*
* @param tree 構造出來的樹
* @param word 指定的詞
* @param KeyWord 對應的詞
* @return
*/
public static Map<String, Map> makeTreeByWord(Map<String, Map> tree, String word,
KeyWord KeyWord) {
if (StringUtils.isEmpty(word)) {
tree.putAll(EndTagUtil.buind(KeyWord));
return tree;
}
String next = word.substring(0, 1);
Map<String, Map> nextTree = tree.get(next);
if (nextTree == null) {
nextTree = new HashMap<String, Map>();
}
// 遞歸構造樹結構
tree.put(next, makeTreeByWord(nextTree, word.substring(1), KeyWord));
return tree;
}
複制代碼
對關鍵詞字元串,逐個字元進行建構。
3.4 詞庫樹的生成
/**
* 構造、生成詞庫樹。并傳回所有敏感詞中最短的詞的長度。
*
* @param sensitiveWords 詞庫
* @param wordsTree 聚合詞庫的樹
* @return 傳回所有敏感詞中最短的詞的長度。
*/
public int generalTree(Set<KeyWord> sensitiveWords, Map<String, Map> wordsTree) {
if (sensitiveWords == null || sensitiveWords.isEmpty() || wordsTree == null) {
return 0;
}
wordsTreeTmp.clear();
int len = 0;
for (KeyWord kw : sensitiveWords) {
if (len == 0) {
len = kw.getWordLength();
} else if (kw.getWordLength() < len) {
len = kw.getWordLength();
}
AnalysisUtil
.makeTreeByWord(wordsTreeTmp, StringUtils.trimToEmpty(kw.getWord()), kw);
}
wordsTree.clear();
wordsTree.putAll(wordsTreeTmp);
return len;
}
複制代碼
3.5 關鍵詞提取
/**
* 将文本中的關鍵詞提取出來。
*/
public List<SensitiveWordResult> process(Map<String, Map> wordsTree, String text,
AbstractFragment fragment, int minLen) {
// 詞的前面一個字
String pre = null;
// 詞比對的開始位置
int startPosition = 0;
// 傳回結果
List<SensitiveWordResult> rs = new ArrayList<SensitiveWordResult>();
while (true) {
try {
if (wordsTree == null || wordsTree.isEmpty() || StringUtils.isEmpty(text)) {
return rs;
}
if (text.length() < minLen) {
return rs;
}
String chr = text.substring(0, 1);
text = text.substring(1);
Map<String, Map> nextWord = wordsTree.get(chr);
// 沒有對應的下一個字,表示這不是關鍵詞的開頭,進行下一個循環
if (nextWord == null) {
pre = chr;
continue;
}
List<KeyWord> keywords = Lists.newArrayList();
KeyWord kw = AnalysisUtil.getSensitiveWord(chr, pre, nextWord, text, keywords);
if (keywords == null || keywords.size() == 0) {
// 沒有比對到完整關鍵字,下一個循環
pre = chr;
continue;
}
for (KeyWord tmp : keywords) {
// 同一個word多次出現記錄在一起
SensitiveWordResult result = new SensitiveWordResult(startPosition, tmp.getWord());
int index = rs.indexOf(result);
if (index > -1) {
rs.get(index).addPosition(startPosition, tmp.getWord());
} else {
rs.add(result);
}
}
// 從text中去除目前已經比對的内容,進行下一個循環比對
// 這行注釋了,避免"中國人",導緻"國人"。搜尋不出來,逐個字元周遊
// text = text.substring(kw.getWordLength() - 1);
pre = kw.getWord().substring(kw.getWordLength() - 1, kw.getWordLength());
continue;
} finally {
if (pre != null) {
startPosition = startPosition + pre.length();
}
}
}
}
/**
* 查詢文本開頭的詞是否在詞庫樹中,如果在,則傳回對應的詞,如果不在,則傳回null。return 傳回找到的最長關鍵詞
*
* @param append 追加的詞
* @param pre 詞的前一個字,如果為空,則表示前面沒有内容
* @param nextWordsTree 下一層樹
* @param text 剩餘的文本内容
* @param keywords 傳回的keywords,可能多個
* @return 傳回找到的最長關鍵詞
*/
public static KeyWord getSensitiveWord(String append, String pre,
Map<String, Map> nextWordsTree, String text, List<KeyWord> keywords) {
if (nextWordsTree == null || nextWordsTree.isEmpty()) {
return null;
}
Map<String, Object> endTag = nextWordsTree.get(EndTagUtil.TREE_END_TAG);
// 原始文本已到末尾
if (StringUtils.isEmpty(text)) {
// 如果有結束符,則表示比對成功,沒有,則傳回null
if (endTag != null) {
keywords.add(checkPattern(getKeyWord(append, endTag), pre, null));
return checkPattern(getKeyWord(append, endTag), pre, null);
} else {
return null;
}
}
String next = text.substring(0, 1);
String suffix = text.substring(0, 1);
Map<String, Map> nextTree = nextWordsTree.get(next);
// 沒有遇到endTag,繼續比對
if (endTag == null) {
if (nextTree != null && nextTree.size() > 0) {
// 沒有結束标志,則表示關鍵詞沒有結束,繼續往下走。
return getSensitiveWord(append + next, pre, nextTree, text.substring(1), keywords);
}
// 如果沒有下一個比對的字,表示比對結束!
return null;
} else { // endTag , 添加關鍵字
KeyWord tmp = getKeyWord(append, endTag);
keywords.add(checkPattern(tmp, pre, suffix));
}
// 有下一個比對的詞則繼續比對,一直取到最大的比對關鍵字
KeyWord tmp = null;
if (nextTree != null && nextTree.size() > 0) {
// 如果大于0,則表示還有更長的詞,繼續往下找
tmp = getSensitiveWord(append + next, pre, nextTree, text.substring(1), keywords);
if (tmp == null) {
// 沒有更長的詞,則就傳回這個詞。在傳回之前,先判斷它是模糊的,還是精确的
tmp = getKeyWord(append, endTag);
}
return checkPattern(tmp, pre, suffix);
}
// 沒有往下的詞了,傳回該關鍵詞。
return checkPattern(getKeyWord(append, endTag), pre, suffix);
}
複制代碼
思路是對某個字元串text,逐個字元ch,擷取ch對應的詞庫樹的children,然後擷取比對到的單個或多個結果,将比對到的關鍵詞在text中的開始和結束下标進行記錄,如後續需要html标記,或者字元替換可直接使用。如果未能在詞庫樹中找到對應的ch的children,或者詞庫樹的children未能比對到去除ch的子字元串,則繼續循環。具體可再詳細讀一下代碼。
4 Radix Tree的應用
4.1 RAX - Redis Tree
Redis實作了不定長壓縮字首的radix tree,用在叢集模式下存儲slot對應的的所有key資訊。
/* Representation of a radix tree as implemented in this file, that contains
* the strings "foo", "foobar" and "footer" after the insertion of each
* word. When the node represents a key inside the radix tree, we write it
* between [], otherwise it is written between ().
*
* This is the vanilla representation:
*
* (f) ""
* \
* (o) "f"
* \
* (o) "fo"
* \
* [t b] "foo"
* / \
* "foot" (e) (a) "foob"
* / \
* "foote" (r) (r) "fooba"
* / \
* "footer" [] [] "foobar"
*
* However, this implementation implements a very common optimization where
* successive nodes having a single child are "compressed" into the node
* itself as a string of characters, each representing a next-level child,
* and only the link to the node representing the last character node is
* provided inside the representation. So the above representation is turned
* into:
*
* ["foo"] ""
* |
* [t b] "foo"
* / \
* "foot" ("er") ("ar") "foob"
* / \
* "footer" [] [] "foobar"
*
* However this optimization makes the implementation a bit more complex.
* For instance if a key "first" is added in the above radix tree, a
* "node splitting" operation is needed, since the "foo" prefix is no longer
* composed of nodes having a single child one after the other. This is the
* above tree and the resulting node splitting after this event happens:
*
*
* (f) ""
* /
* (i o) "f"
* / \
* "firs" ("rst") (o) "fo"
* / \
* "first" [] [t b] "foo"
* / \
* "foot" ("er") ("ar") "foob"
* / \
* "footer" [] [] "foobar"
*
* Similarly after deletion, if a new chain of nodes having a single child
* is created (the chain must also not include nodes that represent keys),
* it must be compressed back into a single node.
*
*/
#define RAX_NODE_MAX_SIZE ((1<<29)-1)
typedef struct raxNode {
uint32_t iskey:1; /* Does this node contain a key? */
uint32_t isnull:1; /* Associated value is NULL (don't store it). */
uint32_t iscompr:1; /* Node is compressed. */
uint32_t size:29; /* Number of children, or compressed string len. */
unsigned char data[];
} raxNode;
typedef struct rax {
raxNode *head;
uint64_t numele;
uint64_t numnodes;
} rax;
typedef struct raxStack {
void **stack; /* Points to static_items or an heap allocated array. */
size_t items, maxitems; /* Number of items contained and total space. */
void *static_items[RAX_STACK_STATIC_ITEMS];
int oom; /* True if pushing into this stack failed for OOM at some point. */
} raxStack;
複制代碼
如Redis源碼中的注釋所寫,RAX進行了一些優化,并不會将一個字元串直接按照每個字元進行樹的建構,而是在Insert有沖突時節點分割處理,在Delete時如果子節點和父節點都隻有一個,則需要進行合并操作。
對于RAX有興趣的同學,可以看一下rax.h、rax.c的相關代碼。
4.2 Linux核心
Linux radix樹最廣泛的用途是用于記憶體管理,結構address_space通過radix樹跟蹤綁定到位址映射上的核心頁,該radix樹允許記憶體管理代碼快速查找辨別為dirty或writeback的頁。Linux radix樹的API函數在lib/radix-tree.c中實作。
Linux基數樹(radix tree)是将指針與long整數鍵值相關聯的機制,它存儲有效率,并且可快速查詢,用于指針與整數值的映射(如:IDR機制)、記憶體管理等。
struct radix_tree_node {
unsigned int path;
unsigned int count;
union {
struct {
struct radix_tree_node *parent;
void *private_data;
};
struct rcu_head rcu_head;
};
/* For tree user */
struct list_head private_list;
void __rcu *slots[RADIX_TREE_MAP_SIZE];
unsigned long tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
};
複制代碼
關于Linux核心使用Radix Tree的具體代碼,有興趣的同學可以繼續深入。
5 總結
Trie樹在單詞搜尋、統計、排序等領域有大量的應用。文章從基礎概念到具體的髒話過濾的應用、Redis的RAX和Linux核心的Radix Tree對Trie樹做了介紹。資料結構和算法是程式高性能的基礎,本文抛磚引玉,希望大家對Trie樹有所了解,并在未來開發過程實踐和應用Trie樹解決中類似情景的問題。