天天看點

Python股票分析系列——資料整合.p7

歡迎來到Python for Finance教程系列的第7部分。 在之前的教程中,我們為整個标準普爾500強公司抓取了雅虎财經資料。 在本教程中,我們将把這些資料組合到一個DataFrame中。

到此為止的代碼:

import bs4 as bs
import datetime as dt
import os
import pandas_datareader.data as web
import pickle
import requests


def save_sp500_tickers():
    resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    table = soup.find('table', {'class': 'wikitable sortable'})
    tickers = []
    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        tickers.append(ticker)
    with open("sp500tickers.pickle", "wb") as f:
        pickle.dump(tickers, f)
    return tickers


# save_sp500_tickers()
def get_data_from_yahoo(reload_sp500=False):
    if reload_sp500:
        tickers = save_sp500_tickers()
    else:
        with open("sp500tickers.pickle", "rb") as f:
            tickers = pickle.load(f)
    if not os.path.exists('stock_dfs'):
        os.makedirs('stock_dfs')

    start = dt.datetime(2010, 1, 1)
    end = dt.datetime.now()
    for ticker in tickers:
        # just in case your connection breaks, we'd like to save our progress!
        if not os.path.exists('stock_dfs/{}.csv'.format(ticker)):
            df = web.DataReader(ticker, 'morningstar', start, end)
            df.reset_index(inplace=True)
            df.set_index("Date", inplace=True)
            df = df.drop("Symbol", axis=1)
            df.to_csv('stock_dfs/{}.csv'.format(ticker))
        else:
            print('Already have {}'.format(ticker))


get_data_from_yahoo()      

盡管我們掌握了所有資料,但我們可能想要一起評估資料。為此,我們将把所有的股票資料集合在一起。目前的每個股票檔案都有:開盤價,最高價,最低價,收盤價,成交量和調整收盤價。至少要開始,我們現在大多隻對調整後的收盤感興趣。

def compile_data():
    with open("sp500tickers.pickle","rb") as f:
        tickers = pickle.load(f)

    main_df = pd.DataFrame()      

 首先,我們拉取我們之前制作的代碼清單,并從一個名為main_df的空資料框開始。現在,我們準備讀取每個股票的資料集合:

for count,ticker in enumerate(tickers):
        df = pd.read_csv('stock_dfs/{}.csv'.format(ticker))
        df.set_index('Date', inplace=True)      

 你不需要在這裡使用Python的枚舉,我隻是使用它,是以我們知道我們在讀取所有資料的過程中。你可以疊代代碼。從這一點,我們*可以*生成有趣資料的額外列,如:

df ['{} _ HL_pct_diff'.format(ticker)] =(df ['High'] - df ['Low'])/ df ['Low']
        df ['{} _ daily_pct_chng'.format(ticker)] =(df ['Close'] - df ['Open'])/ df ['Open']      

但現在,我們不會是以而煩惱。隻要知道這可能是一條追尋道路的道路。相反,我們真的隻是對Adj Adj列感興趣:

df.rename(columns={'Adj Close':ticker}, inplace=True)
        df.drop(['Open','High','Low','Close','Volume'],1,inplace=True)      

現在我們已經有了這個專欄(或者像上面那樣額外的......但是請記住,在這個例子中,我們沒有做HL_pct_diff或daily_pct_chng)。請注意,我們已将Adj Adj列重命名為任何股票代碼名稱。我們開始建構共享資料框:

if main_df.empty:
            main_df = df
        else:
            main_df = main_df.join(df, how='outer')      

如果main_df中沒有任何内容,那麼我們将從目前的df開始,否則我們将使用Pandas的加入。

仍然在這個for循環中,我們将再添加兩行:

if count % 10 == 0:
            print(count)      

這将隻輸出目前股票的數量,如果它可以被10整除。什麼樣的計數%10給我們的是餘數,如果計數除以10.是以,如果我們問如果計數%10 == 0,我們是 隻有看到if語句,如果目前計數除以10,餘數為0,或者如果它完全可以被10整除,那麼才會出現True。

當我們完成for循環時:

print(main_df.head())
    main_df.to_csv('sp500_joined_closes.csv')      

這個函數調用它到這一點:

with open("sp500tickers.pickle","rb") as f:
        tickers = pickle.load(f)

    main_df = pd.DataFrame()

    for count,ticker in enumerate(tickers):
        df = pd.read_csv('stock_dfs/{}.csv'.format(ticker))
        df.set_index('Date', inplace=True)

        df.rename(columns={'Adj Close':ticker}, inplace=True)
        df.drop(['Open','High','Low','Close','Volume'],1,inplace=True)

        if main_df.empty:
            main_df = df
        else:
            main_df = main_df.join(df, how='outer')

        if count % 10 == 0:
            print(count)
    print(main_df.head())
    main_df.to_csv('sp500_joined_closes.csv')


compile_data()      

目前完整的代碼為:

import bs4 as bs
import datetime as dt
import os
import pandas as pd
import pandas_datareader.data as web
import pickle
import requests


def save_sp500_tickers():
    resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    table = soup.find('table', {'class': 'wikitable sortable'})
    tickers = []
    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        tickers.append(ticker)
    with open("sp500tickers.pickle", "wb") as f:
        pickle.dump(tickers, f)
    return tickers


# save_sp500_tickers()
def get_data_from_yahoo(reload_sp500=False):
    if reload_sp500:
        tickers = save_sp500_tickers()
    else:
        with open("sp500tickers.pickle", "rb") as f:
            tickers = pickle.load(f)
    if not os.path.exists('stock_dfs'):
        os.makedirs('stock_dfs')

    start = dt.datetime(2010, 1, 1)
    end = dt.datetime.now()
    for ticker in tickers:
        # just in case your connection breaks, we'd like to save our progress!
        if not os.path.exists('stock_dfs/{}.csv'.format(ticker)):
            df = web.DataReader(ticker, 'morningstar', start, end)
            df.reset_index(inplace=True)
            df.set_index("Date", inplace=True)
            df = df.drop("Symbol", axis=1)
            df.to_csv('stock_dfs/{}.csv'.format(ticker))
        else:
            print('Already have {}'.format(ticker))


def compile_data():
    with open("sp500tickers.pickle", "rb") as f:
        tickers = pickle.load(f)

    main_df = pd.DataFrame()

    for count, ticker in enumerate(tickers):
        df = pd.read_csv('stock_dfs/{}.csv'.format(ticker))
        df.set_index('Date', inplace=True)

        df.rename(columns={'Adj Close': ticker}, inplace=True)
        df.drop(['Open', 'High', 'Low', 'Close', 'Volume'], 1, inplace=True)

        if main_df.empty:
            main_df = df
        else:
            main_df = main_df.join(df, how='outer')

        if count % 10 == 0:
            print(count)
    print(main_df.head())
    main_df.to_csv('sp500_joined_closes.csv')


compile_data()      

在下一個教程中,我們将試圖檢視我們是否能夠快速找到資料中的任何關系。