天天看點

java 提取英文文本關鍵字_Java庫,用于從輸入文本中提取關鍵字

小編典典

這是使用ApacheLucene的可能解決方案。我沒有使用最新版本,但使用3.6.2版本,因為這是我所知道的最好的版本。除了之外/lucene-

core-x.x.x.jar,别忘了将/contrib/analyzers/common/lucene-

analyzers-x.x.x.jar下載下傳的存檔中的添加到您的項目中:它包含特定于語言的分析器(在您的情況下尤其是英語)。

注意,這将 _僅_基于輸入文本詞的詞幹找到它們的頻率。然後将這些頻率與英語統計資料進行比較。

資料模型

一個詞幹一詞。不同的詞可能具有相同的詞幹,是以具有相同的詞幹terms。每次找到新術語時,關鍵字頻率都會增加(即使已經找到它-

一個集合會自動删除重複項)。

public class Keyword implements Comparable {

private final String stem;

private final Set terms = new HashSet();

private int frequency = 0;

public Keyword(String stem) {

this.stem = stem;

}

public void add(String term) {

terms.add(term);

frequency++;

}

@Override

public int compareTo(Keyword o) {

// descending order

return Integer.valueOf(o.frequency).compareTo(frequency);

}

@Override

public boolean equals(Object obj) {

if (this == obj) {

return true;

} else if (!(obj instanceof Keyword)) {

return false;

} else {

return stem.equals(((Keyword) obj).stem);

}

}

@Override

public int hashCode() {

return Arrays.hashCode(new Object[] { stem });

}

public String getStem() {

return stem;

}

public Set getTerms() {

return terms;

}

public int getFrequency() {

return frequency;

}

}

實用工具

詞幹:

public static String stem(String term) throws IOException {

TokenStream tokenStream = null;

try {

// tokenize

tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(term));

// stem

tokenStream = new PorterStemFilter(tokenStream);

// add each token in a set, so that duplicates are removed

Set stems = new HashSet();

CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);

tokenStream.reset();

while (tokenStream.incrementToken()) {

stems.add(token.toString());

}

// if no stem or 2+ stems have been found, return null

if (stems.size() != 1) {

return null;

}

String stem = stems.iterator().next();

// if the stem has non-alphanumerical chars, return null

if (!stem.matches("[a-zA-Z0-9-]+")) {

return null;

}

return stem;

} finally {

if (tokenStream != null) {

tokenStream.close();

}

}

}

要搜尋集合(将由潛在關鍵字清單使用):

public static T find(Collection collection, T example) {

for (T element : collection) {

if (element.equals(example)) {

return element;

}

}

collection.add(example);

return example;

}

核心

這是主要的輸入法:

public static List guessFromString(String input) throws IOException {

TokenStream tokenStream = null;

try {

// hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific")

input = input.replaceAll("-+", "-0");

// replace any punctuation char but apostrophes and dashes by a space

input = input.replaceAll("[\\p{Punct}&&[^'-]]+", " ");

// replace most common english contractions

input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\\b", "");

// tokenize input

tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(input));

// to lowercase

tokenStream = new LowerCaseFilter(Version.LUCENE_36, tokenStream);

// remove dots from acronyms (and "'s" but already done manually above)

tokenStream = new ClassicFilter(tokenStream);

// convert any char to ASCII

tokenStream = new ASCIIFoldingFilter(tokenStream);

// remove english stop words

tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, EnglishAnalyzer.getDefaultStopSet());

List keywords = new LinkedList();

CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);

tokenStream.reset();

while (tokenStream.incrementToken()) {

String term = token.toString();

// stem each term

String stem = stem(term);

if (stem != null) {

// create the keyword or get the existing one if any

Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-")));

// add its corresponding initial token

keyword.add(term.replaceAll("-0", "-"));

}

}

// reverse sort by frequency

Collections.sort(keywords);

return keywords;

} finally {

if (tokenStream != null) {

tokenStream.close();

}

}

}

使用guessFromString的方法的Java

Wikipedia文章引言部分,這裡是第10個最常見的關鍵字(即莖)中發現:

java x12 [java]

compil x5 [compiled, compiler, compilers]

sun x5 [sun]

develop x4 [developed, developers]

languag x3 [languages, language]

implement x3 [implementation, implementations]

applic x3 [application, applications]

run x3 [run]

origin x3 [originally, original]

gnu x3 [gnu]

周遊輸出清單,通過擷取集合(在上述示例中的方括号之間顯示),了解每個詞幹的 原始找到的單詞 。terms``[...]

下一步是什麼

将 詞幹頻率/頻率總和 比率與英語統計的比率進行比較,如果可以的話,讓我保持循環:我也可能很感興趣:)

2020-09-23