天天看点

高频词汇提取的Java实现

本文为原创,如需转载,请注明作者和出处,谢谢!

    面对浩瀚的信息海洋,找到想要的资源有时真的是不容易。在大量文字中搜索高频词汇是信息搜索和数据压缩的共通课题。

这次智慧擂台请大家在一个比较庞大的英文文本中找出M个数量最多的短语(由N个单词组成)。统一处理相同的文本文件,该文本只包含英文单词、空格和回行符,比较谁的程序效率最高。

  程序输入:M,N,文本文件路径(M不超过20,N不超过8)

程序输出:高频短语及其数量清单

擂台规则:提交符合以上要求的可执行程序,语言招式不限,点到为止;

我们将在统一的环境下对每位选手的作品进行公平的测试,

比较出综合用时最少的程序。

源程序

import java.io.*;

import java.util.*;

class tt

{

    public String phrase;

    public int count;

}

public class searchphrase

    private static LinkedHashMap phrase = new LinkedHashMap();

    static tt[] max_phrase;

    private static Vector SeparateString(String s)

    {

        Vector v = new Vector();

        String temp = "";

        for (int i = 0; i < s.length(); i++)

        {

            if (s.charAt(i) != ' ')

            {

                temp += s.charAt(i);

            }

            else

                if (temp != "")

                    v.add(temp);

                temp = "";

        }

        if (temp != "")

            v.add(temp);

        return v;

    }

    private static void swap(int pos, int count, String phrase)

        int i;

        if (max_phrase[pos - 1].count < count)

            for (i = pos - 1; i > 0; i--)

                if (max_phrase[i - 1].count > max_phrase[i].count)

                    break;

            max_phrase[pos].count = max_phrase[i].count;

            max_phrase[pos].phrase = max_phrase[i].phrase;

            max_phrase[i].count = count;

            max_phrase[i].phrase = phrase;

    private static void adjust_max(int count, String phrase)

        int i, j;

        if (count <= max_phrase[max_phrase.length - 1].count)

            return;

        for (i = max_phrase.length - 1; i >= 0; i--)

            if (max_phrase[i].phrase.equals(phrase))

                max_phrase[i].count = count;

                if (i > 0)

                {

                    swap(i, count, phrase);

                }

                return;

        max_phrase[max_phrase.length - 1].count = count;

        max_phrase[max_phrase.length - 1].phrase = phrase;

        if (i > 0)

            swap(max_phrase.length - 1, count, phrase);

    private static void js(Vector v, int n)

        String s;

        for (int i = 0; i < v.size() - n + 1; i++)

            s = "";

            for (int j = i; j < i + n; j++)

                s += v.get(j) + " ";

            int count = 1;

            if (phrase.containsKey(s.hashCode()))

                count = Integer.parseInt(phrase.get(s.hashCode()).toString());

                count++;

            phrase.put(s.hashCode(), count);

            adjust_max(count, s);

    public static void main(String[] args)

        try

            long t;

            int m, n;

            String path;

            m = Integer.parseInt(args[0]);

            n = Integer.parseInt(args[1]);

            path = args[2];

            max_phrase = new tt[m];

            for (int i = 0; i < m; i++)

                max_phrase[i] = new tt();

                max_phrase[i].count = 0;

                max_phrase[i].phrase = "";

            t = (new java.util.Date()).getTime();

            java.io.FileReader fr = new java.io.FileReader(path);

            java.io.BufferedReader br = new BufferedReader(fr);

            String s;

            Vector v = null;

            while ((s = br.readLine()) != null)

                v = SeparateString(s);

                js(v, n);

                System.out.println(max_phrase[i].phrase);

                System.out.println(max_phrase[i].count);

                System.out.println();

            t = (new java.util.Date()).getTime() - t;

            System.out.print(t);

            System.out.println(" ms");

        catch (Exception e)

            System.out.println(e.getMessage());

测试结果1:m = 20 n = 8 

under games played won drawn lost goals for

71

tabulated under games played won drawn lost goals

70

games played won drawn lost goals for against

May Xinhua Following are the results from the

69

played won drawn lost goals for against and

59

won drawn lost goals for against and points

Jan Xinhua Following are the results from the

48

Chinas economic efficiency indicators of the sector of

39

The industrial statistics include all stateowned enterprises and

industrial statistics include all stateowned enterprises and the

statistics include all stateowned enterprises and the nonstateowned

include all stateowned enterprises and the nonstateowned ones

all stateowned enterprises and the nonstateowned ones with

stateowned enterprises and the nonstateowned ones with annual

enterprises and the nonstateowned ones with annual sales

and the nonstateowned ones with annual sales income

Xinhua Chinas economic efficiency indicators of the sector

the nonstateowned ones with annual sales income over

nonstateowned ones with annual sales income over million

up percent over the same period last year

35

13594 ms

测试结果2  m = 10 n = 5

Xinhua Following are the results

295

May Xinhua Following are the

209

Following are the results from

183

are the results from the

176

April Xinhua Following are the

141

Jan Xinhua Following are the

122

billion yuan billion US dollars

120

won drawn lost goals for

88

played won drawn lost goals

Dec Xinhua Following are the

87

12437 ms

以上源程序是采用的是最简单的方法,谁有更好,效率更高的方法,请跟贴!!