NLTK擷取文章摘要代碼示例

2023-04-23 16:13:58

import sys
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer


# 擷取文章摘要
# 一旦有了no_of_nouns和no_of_ners分數的清單，就可以利用這些分數，建立更複雜的規則。
# 例如，一個典型的新聞報道将從相關話題的重要細節開始，最後一句話是整個故事的總結
f = open('nyt.txt', 'r')
news_contents = f.read()
result = []
# 句子标記解析
for sent_no, sentence in enumerate(nltk.sent_tokenize(news_contents)):
    no_tokens_of = len(nltk.word_tokenize(sentence))  # 單詞标記解析
    tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  # 詞性标注
    no_of_nouns = len([word for word, pos in tagged if pos in ['NN', 'NNP']])  # 擷取所有名詞
    ners = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence)), binary=False)  # 命名實體識别
    no_of_ners = len([chunk for chunk in ners if hasattr(chunk, 'label')])
    score = (no_of_ners + no_of_nouns)/float(no_tokens_of)
    result.append((sent_no, no_tokens_of, no_of_ners, no_of_nouns, score, sentence))


for sent in sorted(result, key=lambda x: x[4], reverse=True):
    print(sent[5])

print(result)


# 這種摘要的另一個理論是重要的句子通常包含重要的單詞，在整個語料庫中，大部分判别性的單詞都非常重要。
# 包含了判别性強的單詞的句子也非常重要。一個非常簡單的測量方法是計算各個單詞的TF-IDF（詞頻-逆檔案頻率）的分數，
# 然後尋找由重要單詞歸一化得到的平均分數。可以将此平均分數作為标準，選擇摘要句子。
# tf/idf
# from sklearn.feature_extraction.text import TfidfVectorizer
results = []
sentences = nltk.sent_tokenize(news_contents)
vectize = TfidfVectorizer(norm='l2', min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=True)
sklearn_binary = vectize.fit_transform(sentences)
# print(countvectorizer.get_feature_names())
print(sklearn_binary.toarray())
for i in sklearn_binary.toarray():
    results.append(i.sum() / float(len(i.nonzero()[0])))


# 機器翻譯

NLTK擷取文章摘要代碼示例

繼續閱讀

運輸計劃洛谷P2680題目連結題目描述輸入輸出格式樣例思路

如何降低程式員的工資？【你中招沒】

大廠 2 個 Vue 進階實踐技巧總結，效率提升 80%！

親曆！騰訊CDG某空降總監如何職場PUA無辜員工？

使用NLTK做電影評論分析

轉貼一篇關于NLTK的中文文章

python3 使用nltk 進行名實體識别時，報錯UnicodeError：“ascii” codec cannot decode “x08d” 解決方案

NLTK CrossValidationProbDist的一個bug

自然語言處理學習9：NLTK中BigramCollocationFinder的使用

NLP原理及基礎

Learning for NLTK note1

坑3:1.1.2其他語言文本切分

這裡不适合做技術

曹德旺的商業哲學：如果你對誰都不相信，必将一事無成

複習2個月拿下美團offer，我都做了些啥

幹貨，做視訊号「張小龍」不會告訴你的 7 個要點