以文件为单位的分句、分词python封装脚本

利用nlpir进行分词时，需要考虑以下两个问题：

1、如何分句、分段

2、如何表示分词结果

在网上找了几个分句的脚本，感觉都有问题，就只好自己写了，虽然比较简单，但如果要做到面面俱到还是需要仔细考虑，标注采用xml文件，包括article、paragraph、sentence三层结构，代码已注释，大家自己看吧，这个脚本经过了几次测试，应该可以应付大多数正常的文本文件，如果有问题，欢迎反馈。

解析结果如下

<?xml version="1.0" encoding="utf-8"?>

</Sentence>

</Sentence>

代码如下：

# -*- coding: utf8 -*-
__author__ = 'luoshaowei<[email protected]>'
import nlpir
import os
from xml.dom import minidom

cutlist = '。！？'.decode('utf8')
# 添加根节点
def AddRoot(doc):
	doc.appendChild(doc.createComment('分词结果'.decode('utf8')))
	article=doc.createElement('Article')
	doc.appendChild(article)
#添加段落节点
def AddParagraph(doc,id):
	parentnode=doc.documentElement
	node=doc.createElement('Paragraph')
	node.setAttribute('id',str(id))
	parentnode.appendChild(node)
# 将句子相关信息写入节点
def AddSentence(doc,parentnode,id,context,dividelist):
	snode=doc.createElement('Sentence')
	snode.setAttribute('id',str(id))
	snode.setAttribute('context',context)
	for word in dividelist:
		wordnode=doc.createElement('word')
		wordnode.setAttribute('context',word[0].decode('utf-8'))
		wordnode.setAttribute('pos',word[1])
		snode.appendChild(wordnode)

	parentnode.appendChild(snode)
# 根据id获取当前段落节点
def GetParageaphbyid(doc,id):
	pnode=''
	for node in doc.getElementsByTagName('Paragraph'):
		if node.getAttribute('id')==str(id):
			pnode=node
			break
	return pnode
# 判断是否是段落结尾，依据：该行文本以结束标识符及换行符结尾
def IsParagraphEnd(line):
	t=False
	if(FindToken(cutlist,line[-2]) and line[-1]=='\n'):
		t=True
	return t


#检查某字符是否分句标志符号的函数；如果是，返回True，否则返回False  
def FindToken(cutlist, char):
    if char in cutlist:
        return True
    else:
        return False

# 以文件为单位分句,指定源文件及目标文件
def divide_sentence(sourcefile,destfile):
	fps = open(sourcefile)
	fpd = open(destfile, 'w')
	xmldoc=minidom.Document()
	AddRoot(xmldoc)
	paragraphid=0
	sentenceid=0
	linenum=0
	sentencelist = []
	tempsentence = ''
	isparaend=False
	isarticleend=False
	try:
		lines=fps.readlines()
		for line in lines:
			linenum+=1
			if(linenum==len(lines)):
				isarticleend=True
			line=line.decode('gbk')
			# 判断是否空行，如果是则跳过
			if(len(line)<=1):
				continue
			# 判断本行是否是段落结尾或文章结尾
			# 如果是则添加段落节点并将段落结束标识置为真
			if(IsParagraphEnd(line) or isarticleend):
				AddParagraph(xmldoc,paragraphid)
				isparaend=True
			# 将读入的每行文本去除结尾换行符并去除行首空格
			line=line.strip('\n')
			line=line.lstrip()			
			for word in line:
				tempsentence=tempsentence+word
				# 查找句子结束标识，并将找到的句子加入句子列表
				if (FindToken(cutlist, word)):
					sentencelist.append(tempsentence)
					tempsentence = ''
			# 如果本行已是文件最后一行且临时句子缓存不为空，则将缓存加入句子列表
			if(isarticleend and tempsentence!=''):
				sentencelist.append(tempsentence)
			# 如果句子列表不为空并且（已到段落结尾或文件结尾），则填充段落节点
			if(sentencelist!=[] and (isparaend or isarticleend)):
				paranode=GetParageaphbyid(xmldoc,paragraphid)
				for sen in sentencelist:
					wordlist=nlpir.seg(sen.encode('utf-8'))
					AddSentence(xmldoc,paranode,sentenceid,sen,wordlist)
					sentenceid+=1
				sentencelist = []
				isparaend=False
				paragraphid+=1
			
	finally:
		xmldoc.writexml(fpd, addindent=' ', newl='\n', encoding='utf-8')
		fps.close()
		fpd.close()
	return 0

sourcefile = 'E:\\Project\\Python\\Ictclas_test\\test.txt'
destfile = 'E:\\Project\\Python\\Ictclas_test\\test.html'

divide_sentence(sourcefile,destfile)

以文件为单位的分句、分词python封装脚本

继续阅读

无法解析的外部符号 wmain，该符号在函数 "void cdecl mainCRTStartupHelper(struct HINSTANCE *,unsigned short con......

TestLink导出用例转换工具(XML2Excel)

解码器用于语义分割：数据依赖的解码可以实现灵活的特征聚合

YAML简介和PyYAML安全操作YAML支持的类型YAML的优点：yaml的基本语法python操作

Small tricks

libsvm for python 安装

学习软件测试基础测试第七天

Zeppelin 配置访问 REST APIApache Zeppelin Configuration REST API

【Torch】最简洁logging使用指南

27. Remove Element(列表)题目代码

Cloud Studio初体验

使用 ctypes 进行 Python 和 C 的混合编程

【python】【数据处理】画多维数据分布图

【python】netconf协议对接管理设备

「Python 网络自动化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 网络设备

在python中创建excel并写入