上一篇-> 爬蟲練習之資料清洗——基于Pandas

本次以51Job上在東莞地區爬取的以Java為關鍵詞的招聘資料
包括salary company time job_name address字段

目的

本次資料整理的小目标是将薪資資料拿出來單獨處理為統一的格式, 以便後續的資料統計分析和可視化操作

思路

先來看看資料有多醜

薪資原始資料示例

可以看到除了正常的幾千/月, 還有幾萬/月, 以及幾萬/年

不過, 沒看到XX以上的資料. 但是, 你還是要考慮到啊

根據資料格式, 可以把薪資拆成兩行, 以 - 作為分割點, 然後對資料分情況整理, 根據拆分後資料位置得到底薪和薪資上限

代碼

擷取底薪

這裡需要分三種情況(實際是四種, 不過XX千/年這種資料并沒有出現)

XX千/月, XX萬/月, XX萬/年

思路是

判斷: XX千/月, XX萬/月, XX萬/年
找到'-'位置
萬/月和萬/年需要進行轉化
得到底薪

如果遇到沒有上限的資料, 另外寫個判斷即可

函數代碼如下

# coding=utf-8
def cut_word(word):
    if(word.find('萬') == -1):
        # XX千/月
        postion = word.find('-')
        bottomSalary = word[postion-1]
    else:
        if(word.find('年') == -1):
            # XX萬/月
            postion = word.find('-')
            bottomSalary = word[postion-1] + '0.0'      
        else:
            # XX萬/年
            postion = word.find('-')
            bottomSalary = word[postion-1]
            bottomSalary = str(int(bottomSalary) / 1.2)
    return bottomSalary

擷取薪資上限

擷取薪資上限的思路與擷取底薪的思路一緻, 稍改代碼即可 這裡有一個中文坑, 在utf-8的編碼環境下, 一個中文占3個位元組, 是以像'萬/年'這些, 要減去7才能得到正确結果, 而不是減去3
這裡把兩個方法合并于一個函數, 通過變量來獲得上下限

考慮到還有0.X這種數字, 使用類似```bottomSalary = word[:(postion)] + '0.0'``這樣的代碼會出現以下情況

錯誤示範

def cut_word(word, method):
    if method == 'bottom':
        if(word.find('萬') == -1):
            # XX千/月
            postion = word.find('-')
            bottomSalary = str(float(word[:(postion)]))
        else:
            if(word.find('年') == -1):
                # XX萬/月
                postion = word.find('-')
                bottomSalary = str(float(word[:(postion)]) * 10)         
            else:
                # XX萬/年
                postion = word.find('-')
                bottomSalary = word[:(postion)]
                bottomSalary = str(int(bottomSalary) / 1.2)
        return bottomSalary
    if method == 'top':
        length = len(word)
        if(word.find('萬') == -1):
            # XX千/月
            postion = word.find('-')
            topSalary = str(float(word[(postion+1):(length-7)]))
        else:
            if(word.find('年') == -1):
                # XX萬/月
                postion = word.find('-')
                topSalary = str(float(word[(postion+1):(length-7)]) * 10)         
            else:
                # XX萬/年
                postion = word.find('-')
                topSalary = word[(postion+1):(length-7)]
                topSalary = str(int(topSalary) / 1.2)
        return topSalary

函數寫完驗證下結果

這裡用到pandas子產品的apply方法, 對某一行資料應用自定義函數

# 添加底薪列
df_clean['bottomSalary'] = df_clean.salary.apply(cut_word, method='bottom')
df_clean['topSalary'] = df_clean.salary.apply(cut_word, method='top')
# 選擇salary, bottomSalary, topSalary列
df_clean[['salary', 'bottomSalary', 'topSalary']]

選擇與薪水有關的列顯示, 可以看到結果符合預期(後兩列的機關是K)

計算平均薪資

df_clean['bottomSalary'] = df_clean['bottomSalary'].astype('float')
df_clean['topSalary'] = df_clean['topSalary'].astype('float')
df_clean['avgSalary'] = df_clean.apply(lambda x : (x.bottomSalary + x.topSalary) / 2, axis = 1)

參考文獻

知乎——用pandas進行資料分析實戰
https://zhuanlan.zhihu.com/p/27784143

爬蟲練習之資料整理——基于Pandas目的思路代碼

目的

思路

代碼

繼續閱讀

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

Cloud Studio初體驗

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

使用 ctypes 進行 Python 和 C 的混合程式設計

【python】【資料處理】畫多元資料分布圖

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

Effective Java 8:通用程式設計

OOM三種類型

工廠模式-三種類型

【python】netconf協定對接管理裝置

「Python 網絡自動化」NETCONF —— Python 使用 NETCONF 管理配置 H3C 網絡裝置

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method

在python中建立excel并寫入