自己主动机串标:Directed Acyclic Word Graph

2021-10-08 17:51:37

trie -- suffix tree -- suffix automa 有这么几个情况：

用户输入即时响应AJAX搜索框, 显示候选名单。

搜索引擎keyword统计数量。

后缀树(Suffix Tree): 从根到叶子表示一个后缀。

只从这一个简单的描写叙述，我们能够概念上解决以下的几个问题：

P:查找字符串o是否在字符串S中

A:若o在S中，则o必定是S的某个后缀的前缀。用S构造后缀树。按在trie中搜索字串的方法搜索o就可以。

P: 指定字符串T在字符串S中的反复次数。

A: 假设T在S中反复了两次，则S应有两个后缀以T为前缀，搜索T节点下的叶节点数目即为反复次数。

P: 字符串S中的最长反复子串。

A: 同上。找到最深的非叶节点T。

P: 两个字符串S1。S2的最长公共子串。

A: 广义后缀树(Generalized Suffix Tree)存储_多个_字符串各自的全部后缀。把两个字符串S1#。S2$增加到广义后缀树中，然后同上。

（A longest substring common to s1 and s2 will be the path-label of an internal node with the

greatest string depth in the suffix tree which has leaves labelled with suffixes from both the

strings.）

Suffix Automa: 识别文本全部子串的辅助索引结构。

以下的代码是直接翻译[1]中算法A：

/*Directed Acyclic Word Graph

*/
#include <stdlib.h>
#include <string.h>

typedef struct State{
  struct State *first[26], *second[26];
  struct State *suffix;
}State;

State *sink, *source;

State *new_state(void)
{
  State *s = malloc(sizeof *s);
  if(s){
    memset(s, 0, sizeof *s);
  }
  return s; 
}

/*state:
 parent -- [x] with xa = tail(wa)
 child  -- [tail(wa)]
 new child -- [tail(wa)]_{wa}
*/
State *split(State *parent, int a)
{
  int i;
  /*current state, child, new child*/
  State *cs = parent, *c = parent->second[a], *nc = new_state(); //S1
  parent->first[a] = parent->second[a] = nc; //S2
  for(i = 0; i < 26; ++i){
    nc->second[i] = c->second[i]; //S3
  } 
  nc->suffix = c->suffix; //S4
  c->suffix = nc; //S5

  for(cs = parent; cs != source; ){//S6,7
    cs = cs->suffix; //S7.a
    for(i = 0; i < 26; ++i){
      if(cs->second[i] == c)cs->second[i] = nc; //S7.b
      else goto _out; //S7.c
    }
  } 
_out:
  return nc; //S8
}

/*state:
 new sink -- [wa] 
*/
void update(int a)
{
  /*suffix state, current state, new sink*/
  State *ss = NULL, *cs = sink, *ns = new_state(); //U1,2 
  sink->first[a] = ns;

  while(cs != source && ss == NULL){//U3
    cs = cs->suffix; //U3.a 
    if(!cs->first[a] && !cs->second[a]){
      cs->second[a] = ns; //U3.b.1
    }else if(cs->first[a]){
      ss = cs->first[a]; //U3.b.2 
    }else if(cs->second[a]){
      ss = split(cs, a); //U3.b.3
    }
  }

  if(ss == NULL){ss = source;} //U4
  ns->suffix = ss; sink = ns; //U5
}

int build_dawg(char *w)
{
  sink = source = new_state();
  for(; *w; ++w){update(*w-'a');}
}

我还在努力理解中，没有測试。

[1] the smallest automation recognizing the subwords of a text

https://cbse.soe.ucsc.edu/sites/default/files/smallest_automaton1985.pdf

自己主动机串标:Directed Acyclic Word Graph

继续阅读

如何下载blob:https://www.bilibili.com/的视频

leetcode3:longest-substring-without-repeating-characters

Mysql入门之简单的DQL查询语句【Mysql数据库基础】

BZOJ3643 Phi的反函数（数论+搜索）

【leetcode】32.longest-valid-parentheses（最长有效括号）【leetcode】32.longest-valid-parentheses（最长有效括号）

力扣每日一题：65. 有效数字题目：65. 有效数字解题思路

HDU 3608 最长回文

HDU 5685 Problem A

sqli-labs第一关--报错注入练习

python之判断字符里面有没有|8

浅谈MFC CSting类拼接效率

C和java中关于字符串与字符数组的定义和转化

Java：String类字符串与数组之间的转换

Java中 byte[] 数组与 String 字符串的转化

回文树学习小结

查找文件中的字符串