錯誤、不一緻資料處理（fuzzywuzzy函數的使用）

繼前面幾天的比賽後的又一場比賽（簡單的資料處理）

其一：做個學習的資料記錄。其二：分享出來，供大家參考。

這是關于當資料中存在差異資料或者輸入錯誤的不一緻資料的處理的問題。

簡單說一部分：有時候當我們在處理資料時，其中包含有一些大小寫不一緻，或者字母拼寫有問題的錯誤，為了得到更為精确的資料集，我們需要對這部分的異常資料進行處理。

import pandas as pd        #導入相關的子產品
import fuzzywuzzy
from fuzzywuzzy import process
import chardet

with open("d:/challeng data.csv","rb") as f:
	result = chardet.detect(f.read(100000))
print(result)

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}        #日常對檔案的編碼問題進行判斷（繼上一篇的知識）

suicide_attacks = pd.read_csv("d:/challeng data.csv",encoding="Windows-1252")

使用正确的編碼來讀取檔案，是第一步。

cities = suicide_attacks['City'].unique()    #獲得City這一列的資料集
cities.sort()        #排序後檢視其内容
print(cities)

'D. I Khan' 'D.G Khan' 'D.G Khan '        #這是其中一部分的資料，可以看出，它們的名字相似，但是有的字母不一緻或者空格符号

suicide_attacks['City'] = suicide_attacks['City'].str.lower()    #使用lower 将原資料的城市名變為小寫
suicide_attacks['City'] = suicide_attacks['City'].str.strip()    #使用strip 将原資料的城市名的空格去掉

array(['attock', 'bajaur agency', 'bannu', 'bhakkar', 'buner', 'chakwal',
       'chaman', 'charsadda', 'd. i khan', 'd.g khan', 'd.i khan',        #這行可以看清楚，這幾個名字的變化
       'dara adam khel', 'fateh jang', 'ghallanai, mohmand agency',
       'gujrat', 'hangu', 'haripur', 'hayatabad', 'islamabad',
       'jacobabad', 'karachi', 'karak', 'khanewal', 'khuzdar',
       'khyber agency', 'kohat', 'kuram agency', 'kurram agency',
       'lahore', 'lakki marwat', 'lasbela', 'lower dir', 'malakand',
       'mansehra', 'mardan', 'mohmand agency',
       'mosal kor, mohmand agency', 'multan', 'muzaffarabad',
       'north waziristan', 'nowshehra', 'orakzai agency', 'peshawar',
       'pishin', 'poonch', 'quetta', 'rawalpindi', 'sargodha',
       'sehwan town', 'shabqadar-charsadda', 'shangla', 'shikarpur',
       'sialkot', 'south waziristan', 'sudhanoti', 'sukkur', 'swabi',
       'swat', 'taftan', 'tangi, charsadda district', 'tank', 'taunsa',
       'tirah valley', 'totalai', 'upper dir', 'wagah', 'zhob'],
      dtype=object)

下面我們對"d.i khan"這個名字進行模糊比對：

matches = fuzzywuzzy.process.extract("d.i khan", cities, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

參數（比對字元串，資料集，限制，scorer=方法）

我們能得到模糊比對的結果：

[('d. i khan', 100),
 ('d.i khan', 100),
 ('d.g khan', 88),
 ('khanewal', 50),
 ('sudhanoti', 47),
 ('hangu', 46),
 ('kohat', 46),
 ('dara adam khel', 45),
 ('chaman', 43),
 ('mardan', 43)]

這個數組中存放的是二維的資料（名字，相似度）

下面我們編寫一個子函數來對這些資料的模糊查詢和替換：

def replace_matches(df,column,string_to_match,min_ratio = 90):
	strings =  df[column].unique()    #擷取"City"列的資料集
	matches = fuzzywuzzy.process.extract(string_to_match,strings,limit=10,scorer=fuzzywuzzy.fuzz.token_sort_ratio)    #進行模糊比對
	close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]    #清單循環選出相似度大于等于90的
	rows_with_matches = df[column].isin(close_matches)        #判斷這些相似資料存在與原資料中的位置
	df.loc[rows_with_matches,column] = string_to_match        #使用loc  選中City列，選中相似資料所在行  替換為"d.i khan"

replace_matches(df=suicide_attacks,column='City',string_to_match='d.i khan')

執行完上面的函數調用後，原資料中的與"d.i khan"相似資料将被替換。

這裡隻是簡單的說明了使用fuzzywuzzy子產品來進行模糊處理，并沒有仔細講解它的用法。有需要的朋友可以自行google或百度。

錯誤、不一緻資料處理（fuzzywuzzy函數的使用）

這是關于當資料中存在差異資料或者輸入錯誤的不一緻資料的處理的問題。

繼續閱讀

windows10 64bit + Anaconda + python3.5 安裝xgboost的一種簡單方法

資料挖掘-歸一化

Anaconda：Matpotlib工具安裝

anaconda安裝及使用小技巧anaconda使用小技巧

Anaconda環境配置

一、Python資料挖掘（環境篇——Anaconda與Jupyter Notebook）一、Python資料挖掘（環境篇——Anaconda與Jupyter Notebook）

Anaconda3安裝face_recognitionAnaconda3(python3.7.4)安裝face_recognition

pandas模仿excel對資料處理并可視化

資料挖掘中的隐私保護

資料挖掘研究内容和本質（轉）

資料挖掘分類技術

淺談資料挖掘評估技術

資料挖掘001

從大資料看技術，為什麼天貓雙11是史上最大數字經濟節日

用Matlab搞計算機視覺是怎樣的體驗？

在weka中內建自己的算法

錯誤、不一緻資料處理（fuzzywuzzy函數的使用）

這是關于 當資料中存在差異資料或者輸入錯誤的不一緻資料的處理 的問題。

繼續閱讀

這是關于當資料中存在差異資料或者輸入錯誤的不一緻資料的處理的問題。