TF/IDF的理解和实现

TF/IDF是一种用于资讯检索与文本挖掘的常用加权技术。它是一种统计方法，用于评估某一个字词对于一份文件集或一个语料库的某一个文件的重要程度。字词的重要性会随着它在某份文件出现的频率的增加而增加，又会随着在语料库中包含它的文件的增多而减少。下面是对TF和IDF的详细理解：

TF是词频（Term Frequency）的简称，表示一个term与某个document的相关性。计算方式为：某个term在该文档出现的总次数/该文档所有term的总数。

IDF是反文档频率（Inverse Document Frequency）的简称，表示一个term占某个document主题权重的大小。主要是通过包含了该term的文档数量与总文档的数量来进行比较的。计算方式为：log（D/Dt）,D是Document的总数量，Dt是包含了该term的文档数量。

使用TF*IDF可以计算某个关键字在某篇文档中的重要性，根据重要性可以识别这篇文章的主要含义，就像是计算机读懂了该篇文章。本文要实现的一个功能就是计算某一目录下的文本文件中的字词的TF*IDF的值，也即对某一目录下的文本文件进行关键词索引的过程。相关思路如下：

1）扫描该目录下的所有文本文件，并将每一个文本文件的内容加入String对象的列表中

2）对String对象的列表的每个String对象进行分词（采用的是IKAnalyzer分词技术），将所分得的词加入一个String数组中

3）对每个String数组的词进行统计，从而计算出每个文档中的所分得词的TF值

4）再对每个文档中所分得词中寻找包含该词的文档数，并利用公式log（D/Dt）

5) 最后得到每个文档中所分得词的TF*IDF值

相关代码如下所示：

public class TermCountTest {
	//保存某一目录下的所有文件的数量
	private static List<String> fileList=new ArrayList<String>();
	private static HashMap<String,HashMap<String,Float>> tfOfFile=new HashMap<String,HashMap<String,Float>>();
	public static List<String> readDirs(String filePath) throws FileNotFoundException,IOException
	{
		File file=new File(filePath);
		if(!file.isDirectory())
		{
			System.out.println("请输入【文件夹名】!");
			System.out.println("filePath:"+file.getAbsolutePath());
		}else
		{
			String[] myFile=file.list();
			for(String filepath:myFile)
			{
				File readFile=new File(filePath+"\\"+filepath);
				if(!readFile.isDirectory())
				{
					fileList.add(readFile.getAbsolutePath());
				}else
					readDirs(readFile.getAbsolutePath());
			}
		}
		
		return fileList;
	}
	
	public static String readFiles(String file) throws FileNotFoundException,IOException
	{
		StringBuilder sb=new StringBuilder();
		BufferedReader br=new BufferedReader(new InputStreamReader(new FileInputStream(new File(file)),"gbk"));
		String line=br.readLine();
		while(line!=null)
		{
			sb.append(line).append("\r\n");
			line=br.readLine();
		}
		br.close();
		return sb.toString();
	}
	
	public static List<String> cutWord(String file) throws Exception
	{
		String text=readFiles(file);
		List<String> resultList=IKAnalyzerUnits.getAnalysisResult(text);
		return resultList;
	}
    	
	public static HashMap<String,Float> getTF(List<String> wordList)
	{
		HashMap<String,Float> tf=new HashMap<String,Float>();
		Iterator<String> myIterator=wordList.iterator();
		Float wordNum=(float) wordList.size();
		while(myIterator.hasNext())
		{
			String myWord=myIterator.next();
			Float wordFloat=tf.get(myWord);
			if(wordFloat==null)//说明不存在，则加入
			{
				tf.put(myWord,(float) (1.0/wordNum));
			}else{
				tf.put(myWord,(float)((1.0+wordFloat*wordNum)/wordNum));
			}	
		}
		return tf;
	}
	
	public static HashMap<String,HashMap<String,Float>> getAllTheTF(String dir) throws Exception
	{
		List<String> myFileList=readDirs(dir);
		for(String filepath:myFileList)
		{
			HashMap<String,Float> dict=getTF(cutWord(filepath));
			tfOfFile.put(filepath,dict);
		}
		return tfOfFile;
	}
	
	public static HashMap<String,Float> getIDF(String dir)
	{
		//公式IDF=log(|D|/|Dt|),|D|表示文档总数，|Dt|表示包含关键词t的文档数量
		HashMap<String,Float> idf=new HashMap<String,Float>();
		
		float Dt=0;
		float D=fileList.size();//文档总数
		
		for(int i=0;i<D;i++)
		{
			HashMap<String,Float> temp=tfOfFile.get(fileList.get(i));
			for(String word:temp.keySet())
			{
				Dt=1;
				for(int k=0;k<D;k++)
					{
						if(k!=i)
						{
							HashMap<String,Float> temp2=tfOfFile.get(fileList.get(k));
							if(temp2.keySet().contains(word))
								Dt++;
							continue;
						}
					}
				idf.put(word,(float) (Math.log(D/Dt)/Math.log(10)));
				}
			}
		   return idf;
	}
	
	public static HashMap<String,HashMap<String,Float>> getTfIdf(String dir) throws Exception
	{
		HashMap<String,HashMap<String,Float>> tf=getAllTheTF(dir);
		HashMap<String,Float> idf=getIDF(dir);
		for(String file:tf.keySet())
		{
			Map<String,Float> singleFile=tf.get(file);
			for(String word:singleFile.keySet())
				singleFile.put(word,singleFile.get(word)*idf.get(word));
		}
		return tf;
	}
	
	public static void main(String args[]) throws Exception
	{
		Map<String,HashMap<String,Float>> TfIdf=getTfIdf("d:/dir");
		Map<String,Float> rank=new TreeMap<String,Float>();
		System.out.println("请输入你要搜索文档的关键词：");
		Scanner myScanner=new Scanner(System.in);
		String keyword="";
		if(myScanner.hasNext())
		keyword=myScanner.next();
		myScanner.close();
		for(String file:TfIdf.keySet())
		{
			HashMap<String,Float> temp=TfIdf.get(file);
			if(temp.keySet().contains(keyword))
				rank.put(file, temp.get(keyword));
		}
		
		List arrayList=new ArrayList(rank.entrySet());
		Collections.sort(arrayList,new Comparator(){
			public int compare(Object o1,Object o2){
				Map.Entry obj1=(Map.Entry)o1;
				Map.Entry obj2=(Map.Entry)o2;
				return ((Float)obj2.getValue()).compareTo((Float)obj1.getValue());
			}
		});
		
		System.out.print(arrayList);
	}
}

TF/IDF的理解和实现

继续阅读

HDU 1067 HASH判重BFS

POJ-2046---Gap （bfs+hash)

hdu 1067(bfs+hash判重)Gap

hdu 5487 Difference of Languages BFS Difference of Languages

oracle 中不使用已有的索引解决办法

【Python】Qt国际化ts文件转excel文件（xml转excel）

对first_name创建唯一索引uniq_idx_firstname问题描述Sql语句

Unable to resolve dependency for ':app@debug/compileClasspath': Could not resolve com.android.suppo

Java网络编程-Socket编程初涉七（UDP协议，简易提供-搜索服务）

HDU 2533 N皇后问题（搜索）

几种常见的迭代器

记一次因MySQL编码问题导致的慢查询排查

Map集合中get不存在的key值，返回值问题

如何下载blob:https://www.bilibili.com/的视频

创业,不能兼职(69)---空间数据和属性数据

BZOJ3643 Phi的反函数（数论+搜索）