第5章 缺失資料
介紹
很少沒有任何缺失值的資料集。 有許多缺失資料的表示。 在資料庫中是NULL值,一些程式設計語言使用NA。缺失值可以是空字元串:''或者甚至是數值88或99等。Pandas顯示缺失值為NaN。
本章将涵蓋:
- 什麼是缺失值
- 如何建立缺失值
- 如何重新編碼并使用缺失值進行計算
可以從numpy中獲得NaN值,在Python中看到缺失值使用幾種方式顯示:NaN,NAN或nan,他們都是相等的。
NaN不等于0或空字元串''。
In [1]: from numpy import NaN, NAN, nan
In [2]: print(NaN == True, NaN == False, NaN == 0, NaN == '', sep='|')
False|False|False|False
In [3]: print(NaN == NaN, NaN == nan, NaN == NAN, nan == NAN, sep='|')
False|False|False|False
In [4]: import pandas as pd
In [5]: print(pd.isnull(NaN), pd.isnull(nan), pd.isnull(NAN), sep='|')
True|True|True
In [6]: print(pd.notnull(NaN), pd.notnull(99), pd.notnull("https://china-testing.github.io"), sep='|')
False|True|True
缺失值的來源
來自加載資料或資料處理
- 加載資料
當我們加載資料時,pandas會自動找到該缺少資料的單元格,并填充NaN值。在read_csv函數中,參數na_values, keep_default_na, na_filter用于處理缺失值。比如:na_values=[99]。na_filter設定為False,在讀大檔案時會提升性能。
5-1.py
import pandas as pd
visited_file = 'data/survey_visited.csv'
print(pd.read_csv(visited_file))
print(pd.read_csv(visited_file, keep_default_na=False))
print(pd.read_csv(visited_file, na_values=[''], keep_default_na=False))
執行結果
$ python3 5-1.py
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaN
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaN
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
- 合并資料
import pandas as pd
visited = pd.read_csv('data/survey_visited.csv')
survey = pd.read_csv('data/survey_survey.csv')
print(visited)
print(survey)
vs = visited.merge(survey, left_on='ident', right_on='taken')
print(vs)
$ python3 5-2.py
ident site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1939-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaN
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
taken person quant reading
0 619 dyer rad 9.82
1 619 dyer sal 0.13
2 622 dyer rad 7.80
3 622 dyer sal 0.09
4 734 pb rad 8.41
5 734 lake sal 0.05
6 734 pb temp -21.50
7 735 pb rad 7.22
8 735 NaN sal 0.06
9 735 NaN temp -26.00
10 751 pb rad 4.35
11 751 pb temp -18.50
12 751 lake sal 0.10
13 752 lake rad 2.19
14 752 lake sal 0.09
15 752 lake temp -16.00
16 752 roe sal 41.60
17 837 lake rad 1.46
18 837 lake sal 0.21
19 837 roe sal 22.50
20 844 roe rad 11.25
ident site dated taken person quant reading
0 619 DR-1 1927-02-08 619 dyer rad 9.82
1 619 DR-1 1927-02-08 619 dyer sal 0.13
2 622 DR-1 1927-02-10 622 dyer rad 7.80
3 622 DR-1 1927-02-10 622 dyer sal 0.09
4 734 DR-3 1939-01-07 734 pb rad 8.41
5 734 DR-3 1939-01-07 734 lake sal 0.05
6 734 DR-3 1939-01-07 734 pb temp -21.50
7 735 DR-3 1930-01-12 735 pb rad 7.22
8 735 DR-3 1930-01-12 735 NaN sal 0.06
9 735 DR-3 1930-01-12 735 NaN temp -26.00
10 751 DR-3 1930-02-26 751 pb rad 4.35
11 751 DR-3 1930-02-26 751 pb temp -18.50
12 751 DR-3 1930-02-26 751 lake sal 0.10
13 752 DR-3 NaN 752 lake rad 2.19
14 752 DR-3 NaN 752 lake sal 0.09
15 752 DR-3 NaN 752 lake temp -16.00
16 752 DR-3 NaN 752 roe sal 41.60
17 837 MSK-4 1932-01-14 837 lake rad 1.46
18 837 MSK-4 1932-01-14 837 lake sal 0.21
19 837 MSK-4 1932-01-14 837 roe sal 22.50
20 844 DR-1 1932-03-22 844 roe rad 11.25
- 使用者輸入
import pandas as pd
from numpy import NaN, NAN, nan
num_legs = pd.Series({'goat': 4, 'amoeba': nan})
print(num_legs)
scientists = pd.DataFrame({'Name': ['Rosaline Franklin', 'William Gosset'],
'Occupation': ['Chemist', 'Statistician'],
'Born': ['1920-07-25', '1876-06-13'],
'Died': ['1958-04-16', '1937-10-16'],
'missing': [NaN, nan]})
print(scientists)
scientists['missing'] = nan
print(scientists)
$ python3 5-3.py
amoeba NaN
goat 4.0
dtype: float64
Born Died Name Occupation missing
0 1920-07-25 1958-04-16 Rosaline Franklin Chemist NaN
1 1876-06-13 1937-10-16 William Gosset Statistician NaN
Born Died Name Occupation missing
0 1920-07-25 1958-04-16 Rosaline Franklin Chemist NaN
1 1876-06-13 1937-10-16 William Gosset Statistician NaN
- 重新索引
5-4.py
import pandas as pd
from numpy import NaN, NAN, nan
gapminder = pd.read_csv('data/gapminder.tsv', sep='\t')
life_exp = gapminder.groupby(['year'])['lifeExp'].mean()
print(life_exp)
print(life_exp.reindex(range(2000, 2010)))
year
1952 49.057620
1957 51.507401
1962 53.609249
1967 55.678290
1972 57.647386
1977 59.570157
1982 61.533197
1987 63.212613
1992 64.160338
1997 65.014676
2002 65.694923
2007 67.007423
Name: lifeExp, dtype: float64
year
2000 NaN
2001 NaN
2002 65.694923
2003 NaN
2004 NaN
2005 NaN
2006 NaN
2007 67.007423
2008 NaN
2009 NaN
Name: lifeExp, dtype: float64
處理缺失資料
- 統計缺失資料
5-5.py
import pandas as pd
from numpy import NaN, NAN, nan
import numpy as np
ebola = pd.read_csv('data/country_timeseries.csv')
print(ebola.head())
print(ebola.count())
num_rows = ebola.shape[0]
print("num_rows")
print(num_rows)
num_missing = num_rows - ebola.count()
print("num_missing:")
print(num_missing)
print(np.count_nonzero(ebola.isnull()))
print(np.count_nonzero(ebola['Cases_Guinea'].isnull()))
print(ebola.Cases_Guinea.value_counts(dropna=False).head())
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone \
0 1/5/2015 289 2776.0 NaN 10030.0
1 1/4/2015 288 2775.0 NaN 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 NaN 8157.0 NaN
4 12/31/2014 284 2730.0 8115.0 9633.0
Cases_Nigeria Cases_Senegal Cases_UnitedStates Cases_Spain Cases_Mali \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
Deaths_Guinea Deaths_Liberia Deaths_SierraLeone Deaths_Nigeria \
0 1786.0 NaN 2977.0 NaN
1 1781.0 NaN 2943.0 NaN
2 1767.0 3496.0 2915.0 NaN
3 NaN 3496.0 NaN NaN
4 1739.0 3471.0 2827.0 NaN
Deaths_Senegal Deaths_UnitedStates Deaths_Spain Deaths_Mali
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
Date 122
Day 122
Cases_Guinea 93
Cases_Liberia 83
Cases_SierraLeone 87
Cases_Nigeria 38
Cases_Senegal 25
Cases_UnitedStates 18
Cases_Spain 16
Cases_Mali 12
Deaths_Guinea 92
Deaths_Liberia 81
Deaths_SierraLeone 87
Deaths_Nigeria 38
Deaths_Senegal 22
Deaths_UnitedStates 18
Deaths_Spain 16
Deaths_Mali 12
dtype: int64
num_rows
122
num_missing:
Date 0
Day 0
Cases_Guinea 29
Cases_Liberia 39
Cases_SierraLeone 35
Cases_Nigeria 84
Cases_Senegal 97
Cases_UnitedStates 104
Cases_Spain 106
Cases_Mali 110
Deaths_Guinea 30
Deaths_Liberia 41
Deaths_SierraLeone 35
Deaths_Nigeria 84
Deaths_Senegal 100
Deaths_UnitedStates 104
Deaths_Spain 106
Deaths_Mali 110
dtype: int64
1214
29
NaN 29
86.0 3
495.0 2
112.0 2
390.0 2
Name: Cases_Guinea, dtype: int64
5-6.py
import pandas as pd
from numpy import NaN, NAN, nan
import numpy as np
ebola = pd.read_csv('data/country_timeseries.csv')
print(ebola.iloc[0:10, 0:5])
print(ebola.fillna(0).iloc[0:10, 0:5])
# 前向填充
print(ebola.fillna(method='ffill').iloc[0:10, 0:5])
# 後向填充
print(ebola.fillna(method='bfill').iloc[0:10, 0:5])
print(ebola.interpolate().iloc[0:10, 0:5])
print(ebola.shape)
ebola_dropna = ebola.dropna()
print(ebola_dropna.shape)
print(ebola_dropna)
ebola['Cases_multiple'] = ebola['Cases_Guinea'] + ebola['Cases_Liberia'] + \
ebola['Cases_SierraLeone']
ebola_subset = ebola.loc[:, ['Cases_Guinea', 'Cases_Liberia',
'Cases_SierraLeone', 'Cases_multiple']]
print(ebola_subset.head(n=10))
print(ebola.Cases_Guinea.sum(skipna = True))
print(ebola.Cases_Guinea.sum(skipna = False))
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
0 1/5/2015 289 2776.0 NaN 10030.0
1 1/4/2015 288 2775.0 NaN 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 NaN 8157.0 NaN
4 12/31/2014 284 2730.0 8115.0 9633.0
5 12/28/2014 281 2706.0 8018.0 9446.0
6 12/27/2014 280 2695.0 NaN 9409.0
7 12/24/2014 277 2630.0 7977.0 9203.0
8 12/21/2014 273 2597.0 NaN 9004.0
9 12/20/2014 272 2571.0 7862.0 8939.0
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
0 1/5/2015 289 2776.0 0.0 10030.0
1 1/4/2015 288 2775.0 0.0 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 0.0 8157.0 0.0
4 12/31/2014 284 2730.0 8115.0 9633.0
5 12/28/2014 281 2706.0 8018.0 9446.0
6 12/27/2014 280 2695.0 0.0 9409.0
7 12/24/2014 277 2630.0 7977.0 9203.0
8 12/21/2014 273 2597.0 0.0 9004.0
9 12/20/2014 272 2571.0 7862.0 8939.0
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
0 1/5/2015 289 2776.0 NaN 10030.0
1 1/4/2015 288 2775.0 NaN 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 2769.0 8157.0 9722.0
4 12/31/2014 284 2730.0 8115.0 9633.0
5 12/28/2014 281 2706.0 8018.0 9446.0
6 12/27/2014 280 2695.0 8018.0 9409.0
7 12/24/2014 277 2630.0 7977.0 9203.0
8 12/21/2014 273 2597.0 7977.0 9004.0
9 12/20/2014 272 2571.0 7862.0 8939.0
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
0 1/5/2015 289 2776.0 8166.0 10030.0
1 1/4/2015 288 2775.0 8166.0 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 2730.0 8157.0 9633.0
4 12/31/2014 284 2730.0 8115.0 9633.0
5 12/28/2014 281 2706.0 8018.0 9446.0
6 12/27/2014 280 2695.0 7977.0 9409.0
7 12/24/2014 277 2630.0 7977.0 9203.0
8 12/21/2014 273 2597.0 7862.0 9004.0
9 12/20/2014 272 2571.0 7862.0 8939.0
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone
0 1/5/2015 289 2776.0 NaN 10030.0
1 1/4/2015 288 2775.0 NaN 9780.0
2 1/3/2015 287 2769.0 8166.0 9722.0
3 1/2/2015 286 2749.5 8157.0 9677.5
4 12/31/2014 284 2730.0 8115.0 9633.0
5 12/28/2014 281 2706.0 8018.0 9446.0
6 12/27/2014 280 2695.0 7997.5 9409.0
7 12/24/2014 277 2630.0 7977.0 9203.0
8 12/21/2014 273 2597.0 7919.5 9004.0
9 12/20/2014 272 2571.0 7862.0 8939.0
(122, 18)
(1, 18)
Date Day Cases_Guinea Cases_Liberia Cases_SierraLeone \
19 11/18/2014 241 2047.0 7082.0 6190.0
Cases_Nigeria Cases_Senegal Cases_UnitedStates Cases_Spain Cases_Mali \
19 20.0 1.0 4.0 1.0 6.0
Deaths_Guinea Deaths_Liberia Deaths_SierraLeone Deaths_Nigeria \
19 1214.0 2963.0 1267.0 8.0
Deaths_Senegal Deaths_UnitedStates Deaths_Spain Deaths_Mali
19 0.0 1.0 0.0 6.0
Cases_Guinea Cases_Liberia Cases_SierraLeone Cases_multiple
0 2776.0 NaN 10030.0 NaN
1 2775.0 NaN 9780.0 NaN
2 2769.0 8166.0 9722.0 20657.0
3 NaN 8157.0 NaN NaN
4 2730.0 8115.0 9633.0 20478.0
5 2706.0 8018.0 9446.0 20170.0
6 2695.0 NaN 9409.0 NaN
7 2630.0 7977.0 9203.0 19810.0
8 2597.0 NaN 9004.0 NaN
9 2571.0 7862.0 8939.0 19372.0
84729.0
nan
參考資料
- 技術支援qq群144081101 591302926 567351477 釘釘免費群21745728
- 本文最新版本位址
- 本文涉及的python測試開發庫 謝謝點贊!
- 本文相關海量書籍下載下傳
- 源碼下載下傳
- 本文英文版書籍下載下傳