這學期參加了服務外包大賽,具體要實作對非結構化資料的分析處理,是以在這裡把這個過程一點點記錄一下。
首先根據python的爬蟲架構,從網頁上擷取了中文文本
![](https://img.laitimes.com/img/__Qf2AjLwojIjJCLyojI0JCLiAzNvwVZ2x2bzNXak9CX90TQNNkRrFlQKBTSvwFbslmZvwFMwQzLcVmepNHdu9mZvwFVywUNMZTY18CX052bm9CX9M2MYVDdykFbKJDTwYVbiVHNHpleO1GTulzRilWO5x0LcRHelR3LcJzLctmch1mclRXY39jNxkTN0cjMwEjNxMDM4EDMy8CX0Vmbu4GZzNmLn9Gbi1yZtl2Lc9CX6MHc0RHaiojIsJye.jpg)
但是由于我不怎麼會進行中文資料,摸索了很久,簡單的通過java的substring把資料分開
package se;
import java.io.File;
import java.io.InputStreamReader;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileWriter;
public class sdf {
public static void main(String args[]) {
try {
String pathname = "info.txt";
File filename = new File(pathname);
InputStreamReader reader = new InputStreamReader(
new FileInputStream(filename));
BufferedReader br = new BufferedReader(reader);
String line = "";
File writename = new File("output1.txt"); // 相對路徑,如果沒有則要建立一個新的output。txt檔案
writename.createNewFile(); // 建立新檔案
BufferedWriter out = new BufferedWriter(new FileWriter(writename));
line = br.readLine();
// System.out.println(line);
out.write(line);
out.write("\r\n");
out.write(" \r\n");
while (line != null) {
line = br.readLine(); // 一次讀入一行資料
// System.out.println(line);
int b=;
for(int i=;i<line.length();i++)
{
if(line.substring(i,i+).equalsIgnoreCase(":"))
b=i;
}
// System.out.println(b);
// System.out.println(line.length());
if(b==||b==line.length()-)continue;
else
{
System.out.print(line.substring(,b));
out.write(line.substring(,b));
//out.flush();
for(int i=;i<=20-b;i++)
{
System.out.printf(" ");
out.write(" ");
}
System.out.print(line.substring(b+, line.length()));
out.write(line.substring(b+, line.length()));
out.write("\r\n");
//System.out.printf("\t");
System.out.printf("\n");
out.flush();
}
//System.out.println(b);
}
out.close(); // 最後記得關閉檔案
} catch (Exception e) {
e.printStackTrace();
}
}
}
然後再将資料分開,由于中間有空格,導入到excel中