文章目录
- 数据读取
- 数据整理
-
- PeriodIndex函数----时间段
- 优化
数据来源: https://www.kaggle.com/uciml/pm25-data-for-five-chinese-cities
数据读取
from matplotlib import font_manager
from matplotlib import pyplot as plt
import pandas as pd
file_path = "./BeijingPM2.5.csv"
df = pd.read_csv(file_path)
print(df.head())
print(df.info())
No year month day hour season PM_Dongsi PM_Dongsihuan \
0 1 2010 1 1 0 4 NaN NaN
1 2 2010 1 1 1 4 NaN NaN
2 3 2010 1 1 2 4 NaN NaN
3 4 2010 1 1 3 4 NaN NaN
4 5 2010 1 1 4 4 NaN NaN
PM_Nongzhanguan PM_US Post DEWP HUMI PRES TEMP cbwd Iws \
0 NaN NaN -21.0 43.0 1021.0 -11.0 NW 1.79
1 NaN NaN -21.0 47.0 1020.0 -12.0 NW 4.92
2 NaN NaN -21.0 43.0 1019.0 -11.0 NW 6.71
3 NaN NaN -21.0 55.0 1019.0 -14.0 NW 9.84
4 NaN NaN -20.0 51.0 1018.0 -12.0 NW 12.97
precipitation Iprec
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52584 entries, 0 to 52583
Data columns (total 18 columns):
No 52584 non-null int64
year 52584 non-null int64
month 52584 non-null int64
day 52584 non-null int64
hour 52584 non-null int64
season 52584 non-null int64
PM_Dongsi 25052 non-null float64
PM_Dongsihuan 20508 non-null float64
PM_Nongzhanguan 24931 non-null float64
PM_US Post 50387 non-null float64
DEWP 52579 non-null float64
HUMI 52245 non-null float64
PRES 52245 non-null float64
TEMP 52579 non-null float64
cbwd 52579 non-null object
Iws 52579 non-null float64
precipitation 52100 non-null float64
Iprec 52100 non-null float64
dtypes: float64(11), int64(6), object(1)
memory usage: 7.2+ MB
None
数据整理
PeriodIndex函数----时间段
## 把分开的时间字符串通过periodIndex的方法转化为pandas的时间类型
periods = pd.PeriodIndex(year=df["year"],month=df["month"],day=df["day"],hour=df["hour"],freq="H")
df["datetime"] = periods
print(df.head(10))
No year month day hour season PM_Dongsi PM_Dongsihuan \
0 1 2010 1 1 0 4 NaN NaN
1 2 2010 1 1 1 4 NaN NaN
2 3 2010 1 1 2 4 NaN NaN
3 4 2010 1 1 3 4 NaN NaN
4 5 2010 1 1 4 4 NaN NaN
5 6 2010 1 1 5 4 NaN NaN
6 7 2010 1 1 6 4 NaN NaN
7 8 2010 1 1 7 4 NaN NaN
8 9 2010 1 1 8 4 NaN NaN
9 10 2010 1 1 9 4 NaN NaN
PM_Nongzhanguan PM_US Post DEWP HUMI PRES TEMP cbwd Iws \
0 NaN NaN -21.0 43.0 1021.0 -11.0 NW 1.79
1 NaN NaN -21.0 47.0 1020.0 -12.0 NW 4.92
2 NaN NaN -21.0 43.0 1019.0 -11.0 NW 6.71
3 NaN NaN -21.0 55.0 1019.0 -14.0 NW 9.84
4 NaN NaN -20.0 51.0 1018.0 -12.0 NW 12.97
5 NaN NaN -19.0 47.0 1017.0 -10.0 NW 16.10
6 NaN NaN -19.0 44.0 1017.0 -9.0 NW 19.23
7 NaN NaN -19.0 44.0 1017.0 -9.0 NW 21.02
8 NaN NaN -19.0 44.0 1017.0 -9.0 NW 24.15
9 NaN NaN -20.0 37.0 1017.0 -8.0 NW 27.28
precipitation Iprec datetime
0 0.0 0.0 2010-01-01 00:00
1 0.0 0.0 2010-01-01 01:00
2 0.0 0.0 2010-01-01 02:00
3 0.0 0.0 2010-01-01 03:00
4 0.0 0.0 2010-01-01 04:00
5 0.0 0.0 2010-01-01 05:00
6 0.0 0.0 2010-01-01 06:00
7 0.0 0.0 2010-01-01 07:00
8 0.0 0.0 2010-01-01 08:00
9 0.0 0.0 2010-01-01 09:00
# 把datetime设置为索引
df.set_index("datetime",inplace=True)
# 处理缺失数据,删除缺失数据
data = df["PM_US Post"].dropna()
# 画图
_x = data.index
_y = data.values
plt.figure(figsize=(20,8),dpi=80)
plt.plot(range(len(_x)),_y)
plt.xticks(range(0,len(_x),20),list(_x)[::20])
plt.show()
优化
# 进行降采样
df = df.resample("7D").mean()
data = df["PM_US Post"].dropna()
# 画图
_x = data.index
_x = [i.strftime("%Y%m%d") for i in _x]
_y = data.values
plt.figure(figsize=(20,8),dpi=80)
plt.plot(range(len(_x)),_y)
plt.xticks(range(0,len(_x),10),list(_x)[::10],rotation=45)
plt.show()
data_china = df["PM_Dongsi"]
# 画图
_x = data.index
_x = [i.strftime("%Y%m%d") for i in _x]
_x_china = [i.strftime("%Y%m%d") for i in data_china. index]
_y = data.values
_y_china = data_china.values
plt.figure(figsize=(20,8),dpi=80)
plt.plot(range(len(_x)),_y,label="US_POST")
plt.plot(range(len(_x_china)),_y_china,label="CN_POST")
plt.xticks(range(0,len(_x),10),list(_x)[::10],rotation=45)
plt.legend(loc="best")
plt.show()