幹貨：12個案例教你用Python玩轉資料可視化（建議收藏）

導讀：相比于科學，資料分析更像是一門藝術。建立樣式優美的資料可視化是這個藝術中不可缺少的部分。然而，某些人認為優美的，也會有人覺得難以接受。和藝術類似，随着資料分析的快速演變，人們的觀念和品味也一直在變化。但是總的來說沒有人是絕對正确和錯誤的。

作為一個資料藝術家以及有經驗的Python程式員，我們可以從matplotlib、Seaborn、Bokeh和ggplot這些庫裡面選擇一些來使用。

作者：伊凡·伊德裡斯（Ivan Idris）

如需轉載請聯系大資料（ID：hzdashuju）

01 圖形化安斯庫姆四重奏

安斯庫姆四重奏（Anscombe's Quartet）是一個經典案例，它可以說明為什麼可視化是很重要的。四重奏包含了四組統計特性一緻的資料。每個資料集有一些x值以及相對應的y值，我們将在一個IPython Notebook中列出這些名額。如果你繪制出這些資料集，你将發現這些圖表截然不同。

操作步驟

在本節你需要執行如下操作：

（1）由如下導入開始：

import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as plt 
import matplotlib as mpl 
from dautil import report 
from dautil import plotting 
import numpy as np 
from tabulate import tabulate

複制

（2）定義以下函數來計算某一資料集中x和y的均值和方差、相關系數，以及斜率和每個資料集的線性拟合的截距：

def aggregate():
    df = sns.load_dataset("anscombe")

    agg = df.groupby('dataset')\ 
        .agg([np.mean, np.var])\ 
        .transpose()
    groups = df.groupby('dataset')

    corr = [g.corr()['x'][1] for _, g in groups]
    builder = report.DFBuilder(agg.columns)
    builder.row(corr)

    fits = [np.polyfit(g['x'], g['y'], 1) for _, g in groups] 
    builder.row([f[0] for f in fits]) 
    builder.row([f[1] for f in fits]) 
    bottom = builder.build(['corr', 'slope', 'intercept'])

    return df, pd.concat((agg, bottom))

複制

（3）下面這個函數傳回一個字元串，這個字元串有一部分是Markdown，有一部分是重組的文字，有一部分是HTML，這主要是因為原生的Markdown不支援圖表：

def generate(table):
    writer = report.RSTWriter()
    writer.h1('Anscombe Statistics')
    writer.add(tabulate(table, tablefmt='html', floatfmt='.3f'))
    return writer.rst

複制

（4）繪制資料并相應地與Seaborn的lmplot()函數線性拟合：

def plot(df):
    sns.set(style="ticks")
    g = sns.lmplot(x="x", y="y", col="dataset",
        hue="dataset", data=df,
        col_wrap=2, ci=None, palette="muted", size=4,
        scatter_kws={"s": 50, "alpha": 1})

    plotting.embellish(g.fig.axes)

複制

（5）展示一個統計資料的表格如下：

df, table = aggregate()
from IPython.display import display_markdown
display_markdown(generate(table), raw=True)

複制

下表中顯示每個資料集的幾乎相同的統計資料（我修改了IPython配置檔案裡的 custom.css，是以下表是有顔色的）：

（6）以下幾行代碼繪制了資料集：

%matplotlib inline
plot(df)

複制

請參見以下截圖了解最終結果：

02 選擇Seaborn的調色闆

Seaborn的調色闆和matplotlib的顔色表類似。色彩可以幫助你發現資料中的模式，也是重要的可視化組成部分。Seaborn有很豐富的調色闆，在這個示例中會将其可視化。

操作步驟

（1）導入部分如下：

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from dautil import plotting

複制

（2）使用以下函數幫助繪制調色闆：

def plot_palette(ax, plotter, pal, i, label, ncol=1):
    n = len(pal)
    x = np.linspace(0.0, 1.0, n)
    y = np.arange(n) + i*n
    ax.scatter(x, y, c=x,
                cmap=mpl.colors.ListedColormap(list(pal)), 
                s=200)
    plotter.plot(x,y,label=label)
    handles, labels = ax.get_legend_handles_labels()
    ax.legend(loc='best', ncol=ncol, fontsize=18)

複制

（3）分類調色闆（categorical palette）對于分類資料很有用，例如性别、血型等。以下函數可以繪制一些Seaborn的分類調色闆：

def plot_categorical_palettes(ax):
    palettes = ['deep', 'muted', 'pastel', 'bright', 'dark','colorblind']
    plotter = plotting.CyclePlotter(ax)
    ax.set_title('Categorical Palettes')

    for i, p in enumerate(palettes):
        pal = sns.color_palette(p)
        plot_palette(ax, plotter, pal, i, p, 4)

複制

（4）圓形色彩系統（circular color system）通常用HLS（色度亮度飽和度，Hue Lightness Saturation）來取代RGB（紅綠藍Red Gree Blue）顔色空間。如果你有很多分類這将會很有用。以下函數可以使用HLS系統繪制調色闆。

def plot_circular_palettes(ax):
    ax.set_title('Circular Palettes')
    plotter = plotting.CyclePlotter(ax)

    pal = sns.color_palette("hls", 6)
    plot_palette(ax, plotter, pal, 0, 'hls')

    sns.hls_palette(6, l=.3, s=.8)
    plot_palette(ax, plotter, pal, 1, 'hls l=.3 s=.8')

    pal = sns.color_palette("husl", 6)
    plot_palette(ax, plotter, pal, 2, 'husl')

    sns.husl_palette(6, l=.3, s=.8)
    plot_palette(ax, plotter, pal, 3, 'husl l=.3 s=.8')

複制

（5）Seaborn也有基于線上的ColorBrewer工具的調色闆（http://colorbrewer2.org/）。用以下函數繪制出來：

def plot_brewer_palettes(ax):
    ax.set_title('Brewer Palettes')
    plotter = plotting.CyclePlotter(ax)

    pal = sns.color_palette("Paired")
    plot_palette(ax, plotter, pal, 0, 'Paired')

    pal = sns.color_palette("Set2", 6)
    plot_palette(ax, plotter, pal, 1, 'Set2')

複制

（6）連續調色闆（sequential palettes）對于資料範圍很廣的資料來說很有用，比如說有數量級差異的資料。用以下函數繪制出來：

def plot_sequential_palettes(ax):
    ax.set_title('Sequential Palettes')
    plotter = plotting.CyclePlotter(ax)

    pal = sns.color_palette("Blues")
    plot_palette(ax, plotter, pal, 0, 'Blues')

    pal = sns.color_palette("BuGn_r")
    plot_palette(ax, plotter, pal, 1, 'BuGn_r')

    pal = sns.color_palette("GnBu_d")
    plot_palette(ax, plotter, pal, 2, 'GnBu_d')

    pal = sns.color_palette("cubehelix", 6)
    plot_palette(ax, plotter, pal, 3, 'cubehelix')

複制

（7）以下幾行代碼調用了我們之前定義的函數：

%matplotlib inline

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
plot_categorical_palettes(axes[0][0])
plot_circular_palettes(axes[0][1])
plot_brewer_palettes(axes[1][0])
plot_sequential_palettes(axes[1][1])
plotting.hide_axes(axes)
plt.tight_layout()

複制

請參見以下截圖了解最終結果：

03 選擇matplotlib的顔色表

matplotlib的顔色表最近受到了很多批評，因為它們可能會誤導使用者，但是在我看來大多數的顔色表還是不錯的。預設的顔色表在matplotlib 2.0中有一些改進，可以在這裡檢視：

http://matplotlib.org/style_changes.html

當然，有些matplotlib的顔色表不支援一些不錯的參數，比如說jet。在藝術中，就像資料分析中一樣，幾乎沒有什麼東西是絕對正确的，是以這裡就交給讀者去判斷。

實際上，我覺得考慮如何解決印刷出版物以及各種各樣的色盲問題是很重要的。在這個示例中我将用色條來可視化相對安全的顔色表。這裡使用到的是matplotlib衆多顔色表中的很小一部分。

操作步驟

（1）導入部分如下：

import matplotlib.pyplot as plt
import matplotlib as mpl
from dautil import plotting

複制

（2）通過以下代碼畫出資料集：

fig, axes = plt.subplots(4, 4)
cmaps = ['autumn', 'spring', 'summer', 'winter',
         'Reds', 'Blues', 'Greens', 'Purples',
         'Oranges', 'pink', 'Greys', 'gray',
         'binary', 'bone', 'hot', 'cool']

for ax, cm in zip(axes.ravel(), cmaps):
    cmap = plt.cm.get_cmap(cm)
    cb = mpl.colorbar.ColorbarBase(ax, cmap=cmap,
                                    orientation='horizontal')
    cb.set_label(cm)
    ax.xaxis.set_ticklabels([])

plt.tight_layout()
plt.show()

複制

請參見以下截圖了解最終結果：

04 與IPython Notebook部件互動

簡單來說，這些部件可以讓你像在HTML表單裡一樣選擇一些值，這包括滑塊、下拉框、選擇框等。正如你會讀到的，這些部件非常友善将我們在第1章中提及的天氣資料可視化。

操作步驟

（1）導入部分如下：

import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.html.widgets import interact
from dautil import data
from dautil import ts

複制

（2）加載資料同時請求内聯圖：

%matplotlib inline
df = data.Weather.load()

複制

（3）定義以下函數，這個函數會顯示氣泡圖：

def plot_data(x='TEMP', y='RAIN', z='WIND_SPEED', f='A', size=10,cmap='Blues'):
    dfx = df[x].resample(f)
    dfy = df[y].resample(f).mean()
    dfz = df[z].resample(f).mean()

    bubbles = (dfz - dfz.min())/(dfz.max() - dfz.min())
    years = dfz.index.year
    sc = plt.scatter(dfx, dfy, s= size * bubbles + 9, c = years, 
                    cmap=cmap, label=data.Weather.get_header(z), 
                    alpha=0.5)
    plt.colorbar(sc, label='Year')

    freqs = {'A': 'Annual', 'M': 'Monthly', 'D': 'Daily'}
    plt.title(freqs[f] + ' Averages')
    plt.xlabel(data.Weather.get_header(x))
    plt.ylabel(data.Weather.get_header(y))
    plt.legend(loc='best')

複制

（4）通過以下代碼調用我們剛剛定義的函數：

vars = df.columns.tolist()
freqs = ('A', 'M', 'D')
cmaps = [cmap for cmap in plt.cm.datad if not cmap.endswith("_r")]
cmaps.sort()
interact(plot_data, x=vars, y=vars, z=vars, f=freqs,size=(100,700), cmap=cmaps)

複制

（5）本示例需要上手操作一下來了解它的工作原理，下面是一個樣例氣泡圖：

（6）定義另一個函數（和第（2）步中的程式同名，注釋掉前一個），這個函數裡我們将資料按照日或月進行分組：

def plot_data(x='TEMP', y='RAIN', z='WIND_SPEED', 
              groupby='ts.groupby_yday', 
              size=10, cmap='Blues'):
    if groupby == 'ts.groupby_yday':
        groupby = ts.groupby_yday
    elif groupby == 'ts.groupby_month':
        groupby = ts.groupby_month
    else:
        raise AssertionError('Unknown groupby ' + groupby)
    dfx = groupby(df[x]).mean()
    dfy = groupby(df[y]).mean()
    dfz = groupby(df[z]).mean()
    bubbles = (dfz - dfz.min())/(dfz.max() - dfz.min())
    colors = dfx.index.values
    sc = plt.scatter(dfx, dfy, s= size * bubbles + 9, 
                     c = colors,cmap=cmap, 
                     label=data.Weather.get_header(z), alpha=0.5)
    plt.colorbar(sc, label='Day of Year')
    by_dict = {ts.groupby_yday: 'Day of Year', ts.groupby_month: 'Month'}
    plt.title('Grouped by ' + by_dict[groupby])
    plt.xlabel(data.Weather.get_header(x))
    plt.ylabel(data.Weather.get_header(y))
    plt.legend(loc='best')

複制

（7）用這段代碼調用上述函數：

groupbys = ('ts.groupby_yday', 'ts.groupby_month')
interact(plot_data, x=vars, y=vars, z=vars, groupby=groupbys,
size=(100,700), cmap=cmaps)

複制

請參見以下截圖了解最終結果：

我對這個圖的第一印象是溫度和風速似乎是正相關的。

05 檢視散點圖矩陣

如果你的資料集中變量不是很多，那麼檢視你資料所有的散點圖是個不錯的主意。通過調用Seaborn或者pandas的一個函數就可以做到。這些函數會展示一個矩陣的核密度估計圖或對角線上的直方圖。

操作步驟

（1）導入部分如下：

import pandas as pd
from dautil import data
from dautil import ts
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl

複制

（2）以下幾行代碼加載天氣資料：

df = data.Weather.load()
df = ts.groupby_yday(df).mean()
df.columns = [data.Weather.get_header(c) for c in df.columns]

複制

（3）用Seaborn的pairplot()函數繪制圖形，這個函數預設繪制對角線上的直方圖：

%matplotlib inline

# Seaborn plotting, issues due to NaNs
sns.pairplot(df.fillna(0))

複制

結果如下所示：

（4）通過pandas的scatter_matrix()函數生成一個類似的圖形，并請求對角線上的核密度估計圖：

sns.set({'figure.figsize': '16, 12'})
mpl.rcParams['axes.linewidth'] = 9
mpl.rcParams['lines.linewidth'] = 2
plots = pd.scatter_matrix(df, marker='o', diagonal='kde')
plt.show()

複制

請參見以下截圖了解最終結果：

06 通過mpld3使用d3.js進行可視化

d3.js是在2011年推出的一個JavaScript資料可視化庫，我們可以在IPython Notebook裡面使用這個庫。我們将在一個普通matplotlib圖上添加一個懸浮工具提示。這裡我們會使用mpld3包作為使用d3.js的橋梁。這個示例不需要任何JavaScript程式設計。

1. 準備工作

通過以下指令安裝mpld3 0.2：

$ [sudo] pip install mpld3

複制

2. 操作步驟

（1）由導入開始，并啟用mpld3：

%matplotlib inline
import matplotlib.pyplot as plt
import mpld3
mpld3.enable_notebook()
from mpld3 import plugins
import seaborn as sns
from dautil import data
from dautil import ts

複制

（2）加載天氣資料并按照下面的方法将其繪制出來：

df = data.Weather.load()
df = df[['TEMP', 'WIND_SPEED']]
df = ts.groupby_yday(df).mean()

fig, ax = plt.subplots()
ax.set_title('Averages Grouped by Day of Year')
points = ax.scatter(df['TEMP'], df['WIND_SPEED'],
                    s=30, alpha=0.3)
ax.set_xlabel(data.Weather.get_header('TEMP'))
ax.set_ylabel(data.Weather.get_header('WIND_SPEED'))
labels = ["Day of year {0}".format(i) for i in range(366)]
tooltip = plugins.PointLabelTooltip(points, labels)

plugins.connect(fig, tooltip)

複制

高亮顯示的那一行是工具欄。在下面的截圖中，我們可以看到“Day of year 31”文本來自這個工具欄：

如你所見，在這個圖形的底部，還有可以平移和縮放圖形的裝置。

07 建立熱圖

熱圖使用一組顔色在矩陣中可視化資料。最初，熱圖用于表示金融資産（如股票）的價格。Bokeh是一個Python包，可以在IPython Notebook中顯示熱圖，或者生成一個獨立的HTML檔案。

1. 準備工作

Anaconda自帶了Bokeh 0.9.1。Bokeh的安裝說明在：

http://bokeh.pydata.org/en/latest/docs/installation.html

2. 操作步驟

（1）導入部分如下：

from collections import OrderedDict
from dautil import data
from dautil import ts
from dautil import plotting
import numpy as np
import bokeh.plotting as bkh_plt
from bokeh.models import HoverTool

複制

（2）下面的函數加載了溫度資料并按照年和月進行分組：

def load():
    df = data.Weather.load()['TEMP']
    return ts.groupby_year_month(df)

複制

（3）定義一個将資料重排成特殊的Bokeh結構的函數：

def create_source():
    colors = plotting.sample_hex_cmap()
    month = []
    year = []
    color = []
    avg = []
    for year_month, group in load():
        month.append(ts.short_month(year_month[1]))
        year.append(str(year_month[0]))
        monthly_avg = np.nanmean(group.values)
        avg.append(monthly_avg)
        color.append(colors[min(int(abs(monthly_avg)) - 2, 8)])
    source = bkh_plt.ColumnDataSource(data=dict(month=month, year=year, color=color, avg=avg))
    return year, source

複制

（4）定義一個傳回橫軸标簽的函數：

def all_years():
    years = set(year)
    start_year = min(years)
    end_year = max(years)
    return [str(y) for y in range(int(start_year), int(end_year),5)]

複制

（5）定義一個繪制包含了懸浮工具欄的熱圖的函數：

def plot(year, source):
    fig = bkh_plt.figure(title="De Bilt, NL Temperature (1901 -2014)",
                                x_range=all_years(),
                                y_range=list(reversed(ts.short_months())),
                                toolbar_location="left",
                                tools="resize,hover,save,
                                pan,box_zoom,wheel_zoom")
    fig.rect("year", "month", 1, 1, source=source,
                color="color", line_color=None)

    fig.xaxis.major_label_orientation = np.pi/3

    hover = fig.select(dict(type=HoverTool))
    hover.tooltips = OrderedDict([
        ('date', '@month @year'),
        ('avg', '@avg'),
    ])

    bkh_plt.output_notebook()
    bkh_plt.show(fig)

複制

（6）調用上述定義的函數：

year, source = create_source()
plot(year, source)

複制

請參見以下截圖了解最終結果：

08 把箱線圖、核密度圖和小提琴圖組合

小提琴圖（Violin Plot）是一種組合盒圖和核密度圖或直方圖的圖形類型。Seaborn和matplotlib都能提供小提琴圖。在這個示例中我們将使用Seaborn來繪制天氣資料的Z分數（标準分數），分數的标準化并不是必需的，但是如果沒有它的話小提琴圖會很發散。

操作步驟

（1）導入部分如下：

import seaborn as sns
from dautil import data
import matplotlib.pyplot as plt

複制

（2）加載天氣資料并計算标準分數：

df = data.Weather.load()
zscores = (df - df.mean())/df.std()

複制

（3）繪制标準分數的小提琴圖：

%matplotlib inline
plt.figure()
plt.title('Weather Violin Plot')
sns.violinplot(zscores.resample('M').mean())
plt.ylabel('Z-scores')

複制

第一個小提琴圖如下所示：

（4）繪制雨天和旱天相對風速的小提琴圖：

plt.figure()
plt.title('Rainy Weather vs Wind Speed')
categorical = df
categorical['RAIN'] = categorical['RAIN'] > 0
ax = sns.violinplot(x="RAIN", y="WIND_SPEED",data=categorical)

複制

第二個小提琴圖如下所示：

09 使用蜂巢圖可視化網絡圖

蜂巢圖（Hive Plot）是用于繪制網絡圖的可視化技術。在蜂巢圖中我們将邊緣繪制為曲線。我們根據屬性對節點進行分組，并在徑向軸上顯示它們。

有些庫在蜂窩圖方面很專業。同時我們将使用API來劃分Facebook使用者的圖形。

https://snap.stanford.edu/data/egonets-Facebook.html

這個資料屬于斯坦福網絡分析項目（Stanford Network Analysis Project，SNAP），它也提供了Python API，但是目前SNAP API還不支援Python 3。

1. 準備工作

Anaconda自帶了NetworkX 1.9.1，它安裝說明可見：

https://networkx.github.io/documentation/latest/install.html

同時我們還需要community包，安裝位址：

https://bitbucket.org/taynaud/python-louvain

在PyPi上有一個同名的包，但是它和我們需要安裝的沒有任何關系。安裝hiveplot包，這個包托管在：

https://github.com/ericmjl/hiveplot

$ [sudo] pip install hiveplot

複制

本示例中使用的hiveplot版本是0.1.7.4。

2. 操作步驟

（1）導入部分如下所示：

import networkx as nx
import community
import matplotlib.pyplot as plt
from hiveplot import HivePlot
from collections import defaultdict
from dautil import plotting
from dautil import dataython

複制

（2）載入資料，建立一個NetworkX的Graph對象：

fb_file = data.SPANFB().load()
G = nx.read_edgelist(fb_file,create_using = nx.Graph(),nodetype = int)
print(nx.info(G))

複制

（3）分割圖形對象并按照如下的方法建立一個nodes字典：

parts = community.best_partition(G)
nodes = defaultdict(list)

for n, d in parts.items():
    nodes[d].append(n)

複制

（4）這個圖形會非常大，是以我們将會建立三個邊緣分組：

edges = defaultdict(list)

for u, v in nx.edges(G, nodes[0]):
    edges[0].append((u, v, 0))

for u, v in nx.edges(G, nodes[1]):
    edges[1].append((u, v, 1))

for u, v in nx.edges(G, nodes[2]):
    edges[2].append((u, v, 2))

複制

（5）繪制這個圖形大約需要6分鐘：

%matplotlib inline
cmap = plotting.sample_hex_cmap(name='hot', ncolors=len(nodes.keys()))
h = HivePlot(nodes, edges, cmap, cmap)
h.draw()
plt.title('Facebook Network Hive Plot')

複制

等待一段時間，我們可以看到如下的圖形：

10 顯示地圖

無論是處理全球資料還是本地資料，使用地圖都是一個适合的可視化方式。我們需要用坐标來将資料定位到地圖上，通常我們使用的就是這個點的經度和緯度。有很多現有的檔案格式可以存儲地理位置資料。

在這個示例中我們将會使用到特别的shapefile格式以及更常見的制表符分隔值（Tab Separated Values，TSV）格式。shapefile格式是由Esri公司建立的，并包含了三個必需的檔案，它們的擴充名分别是.shp、.shx、.dbf。

.dbf檔案包含了shapefile中每一個地理位置的額外資訊的資料庫。我們将使用的shapefile包含了國家邊界、人口以及國内生産總值（Gross Domestic Product，GDP）的資料。我們可以使用cartopy庫下載下傳shapefile。

TSV檔案包含了超過4000個城市的按時間序列的人口資料，可以在這裡獲得：

https://nordpil.com/resources/world-database-of-large-cities/

1. 準備工作

首先我們需要從源檔案安裝Proj.4，或者你也可以使用二進制版本安裝：

https://github.com/OSGeo/proj.4/wiki

Proj.4的安裝說明在：

https://github.com/OSGeo/proj.4

然後我們可以通過pip安裝cartopy，本示例中使用到的是cartopy-0.13.0。或者你也可以通過下面的指令進行安裝：

$ conda install -c scitools cartopy

複制

2. 操作步驟

（1）導入部分如下所示：

import cartopy.crs as ccrs
import matplotlib.pyplot as plt
       import cartopy.io.shapereader as shpreader
       import matplotlib as mpl
       import pandas as pd
       from dautil import options
       from dautil import data

複制

（2）我們會使用顔色來做國家人口以及人口衆多的城市的可視化。引入如下資料：

countries = shpreader.natural_earth(resolution='110m',
                                    category='cultural',
                                    name='admin_0_countries')
cities = pd.read_csv(data.Nordpil().load_urban_tsv(),sep='\t', encoding='ISO-8859-1')
mill_cities = cities[cities['pop2005'] > 1000]

複制

（3）使用以下代碼畫出地圖，以及相應的顔色條，并将人口衆多的城市标記在地圖上：

%matplotlib inline
plt.figure(figsize=(16, 12))
gs = mpl.gridspec.GridSpec(2, 1,
                          height_ratios=[20, 1])
ax = plt.subplot(gs[0], projection=ccrs.PlateCarree())

norm = mpl.colors.Normalize(vmin=0, vmax=2 * 10 ** 9)
cmap = plt.cm.Blues
ax.set_title('Population Estimates by Country')

for country in shpreader.Reader(countries).records():
   ax.add_geometries(country.geometry, ccrs.PlateCarree(),
                     facecolor=cmap(
                         norm(country.attributes['pop_est'])))

plt.plot(mill_cities['Longitude'],
         mill_cities['Latitude'], 'r.',
         label='Populous city',
         transform=ccrs.PlateCarree())

options.set_mpl_options()
plt.legend(loc='lower left')

cax = plt.subplot(gs[1])
cb = mpl.colorbar.ColorbarBase(cax,
                               cmap=cmap,
                               norm=norm,
                               orientation='horizontal')

cb.set_label('Population Estimate')
plt.tight_layout()

複制

11 使用類ggplot2圖

ggplot2是在R語言使用者群中很流行的資料可視化庫。ggplot2的主要思想是在資料可視化的産出中包含多個圖層。就像一個畫家，我們從一個空的畫布開始，緊接着一步步地添加圖層。

通常我們使用rpy2來讓Python接入R語言代碼。然而，如果我們隻是想使用ggplot2的話，用pyggplot庫會顯得更加友善。在這個示例中将實作三個國家的人口增長的可視化，使用的資料來自pandas上檢索到的世界銀行的資料。這些資料中包含各種名額和相關中繼資料。在這裡可以下載下傳到關于這些名額的描述：

http://api.worldbank.org/v2/en/topic/19?downloadformat=excel

我們可以認為世界銀行的資料集是靜态的。然而，類似的資料集經常發生變化，足以占用分析師所有的時間。更換名額的名字明顯會影響代碼，是以我決定通過joblib庫來緩存資料。但是這個方法美中不足的是不能pickle所有的Python對象。

1. 準備工作

首先你需要有安裝了ggplot2的R語言環境。如果你不是特别想使用ggplot2，或許你可以跳過這個示例。

R語言的首頁是：

http://www.r-project.org/

ggplot2的文檔：

http://docs.ggplot2.org/current/index.html

你可以通過pip安裝pyggplot，我使用的是pyggplot-23。安裝joblib，請浏覽：

https://pythonhosted.org/joblib/installing.html

我的Anaconda中有joblib 0.8.4。

2. 操作步驟

（1）導入部分如下：

import pyggplot
from dautil import data

複制

（2）通過以下代碼加載資料：

dawb = data.Worldbank()
pop_grow = dawb.get_name('pop_grow')
df = dawb.download(indicator=pop_grow, start=1984, end=2014)
df = dawb.rename_columns(df, use_longnames=True)

複制

（3）下面用我們建立的pandas對象DataFrame初始化pyggplot：

p = pyggplot.Plot(df)

複制

（4）添加條形圖：

p.add_bar('country', dawb.get_longname(pop_grow), color='year')

複制

（5）翻轉圖表，使條形圖指向右邊并渲染

p.coord_flip()
p.render_notebook()

複制

請參見以下截圖了解最終結果：

12 使用影響圖高亮資料

類似于氣泡圖，影響圖（influence plot）會考慮到單個資料點拟合、影響和杠杆之後的殘差。殘差的大小繪制在垂直軸上，并且可以辨別資料點是異常值。為了更好地了解影響圖，可以看下面的這些方程。

根據statsmodels文檔，殘差按标準偏差式（2.1）進行縮放，在式（2.2）中，n是觀測點的數量，p是回歸量。式（2.3）我們習慣稱之為帽子矩陣（hat-matrix）。帽子矩陣的對角元素給出稱為杠杆（leverage）的特殊度量，杠杆作為水準軸的量，可以辨別出影響圖的潛在影響。

在影響圖中，影響會決定繪圖點的大小。影響大的點往往具有高殘差和杠杆。statsmodels可以使用Cook距離（Cook's distance）（見式（2.4））或者DFFITS（見式（2.5））來衡量影響值。

操作步驟

（1）導入部分如下：

import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
from dautil import data

複制

（2）擷取可用的國家的編碼：

dawb = data.Worldbank()
countries = dawb.get_countries()[['name', 'iso2c']]

複制

（3）從世界銀行加載資料：

population = dawb.download(indicator=[dawb.get_name('pop_grow'),
                                      dawb.get_name('gdp_pcap'),
                                      dawb.get_name('primary_education')],
                           country=countries['iso2c'],
                           start=2014,
                           end=2014)
population = dawb.rename_columns(population)

複制

（4）定義一個普通最小二乘模型如下：

population_model = ols("pop_grow ~ gdp_pcap + primary_education",
                       data=population).fit()

複制

（5）使用Cook距離描繪這個模型的影響圖：

%matplotlib inline
fig, ax = plt.subplots(figsize=(19.2, 14.4))
fig = sm.graphics.influence_plot(population_model, ax=ax, criterion="cooks")
plt.grid()

複制

請參見以下截圖了解最終結果：

關于作者：Ivan Idris，曾是Java和資料庫應用開發者，後專注于Python和資料分析領域，緻力于編寫幹淨、可測試的代碼。他還是《Python Machine Learning By Example》《NumPy Cookbook》等書的作者，在工程實踐和書籍撰寫方面都非常有經驗。

本文摘編自《Python資料分析實戰》，經出版方授權釋出。

延伸閱讀《Python資料分析實戰》

點選上圖了解及購買

轉載請聯系微信：DoctorData

推薦語：面向實際問題的Python資料分析實踐指南，通過豐富的執行個體、大量的代碼片段和圖例，可以幫助你快速掌握用Python進行資料分析的各種技術。