2 Numpy
- Numpy 高效的運算工具
- Numpy的優勢
- ndarry屬性
- 基本操作
- ndarry.方法()
- numpy.函數名()
- ndarry運算
- 邏輯運算
- 統計運算
- 數組間運算
- 合并、分割、I/O操作、資料處理
2.1 Numpy的優勢
學習目标
- 目标
- 了解Numpy運算速度上的優勢
- 知道Numpy的數組記憶體塊規格
- 知道Numpy的并行運算
- 應用
- 機器學習、深度學習各種架構的基礎庫
- 内容預覽
- 2.1.1 Numpy介紹
- 2.1.2 ndarray介紹
- 2.1.3 ndarray與pyton原生的list運算效率對比
- 2.1.4 ndarray的優勢
2.1.1 Numpy介紹–開源的數值計算庫
- num - numerical
- py - python
- ndarray
- n - 任意個
- d - dimension
- arry - 數組
Numpy(Numerical Python)是一個開源的科學計算庫,用于快速處理任意次元的數組。
Numpy支援常見的數組和矩陣操作。對于同樣的數值計算任務,使用Numpy比直接使用Python要簡潔得多
Numpy使用ndarry對象來處理多元數組,該對象是一個快速而靈活的大資料容器
2.1.2 ndarray介紹
Numpy提供了一個n維數組類型ndarry,它描述了像同類型的"items"的集合
用ndarry進行存儲:
import numpy as np
# 建立ndarry
score = np.array([[85, 69, 83, 76, 93],
[76, 84, 61, 69, 81],
[85, 68, 74, 69, 60],
[92, 98, 68, 100, 64],
[60, 67, 73, 92, 82],
[72, 61, 72, 80, 79],
[88, 91, 62, 95, 80],
[89, 71, 63, 94, 66]])
score
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 98, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
2.1.3 ndarray于python原生list運算效率的對比
import random, time
import numpy as np
a = []
for i in range(100000000):
a.append(random.random())
t1 = time.time()
sum1 = sum(a)
t2 = time.time()
b = np.array(a)
t4 = time.time()
sum3 = np.sum(b)
t5 = time.time()
print('list求和運算耗時:{},ndarry求和運算耗時:{}'.format((t2 - t1), (t5 - t4)))
list求和運算耗時:0.6531758308410645,ndarry求和運算耗時:0.17858529090881348
2.1.4 ndarray的優勢
- 存儲風格
- ndarry - 相同類型,資料連續存儲,線性引用 - 泛用性不強
- list - 可以是不同類型,資料不連續存儲,存在交叉引用 - 泛用性強
- 并行計算
- ndarry支援并行計算
- 底層語言
- Numpy地層使用C語言編寫,内部解除了GIL的限制,使得對數組的操作不受python解釋器的限制,效率遠高于直接使用python
2.2 認識N維數組——ndarray的屬性
學習目标
- 目标
- 說明數組的屬性、形狀、類型
- 應用
- 内容預覽
- 2.2.1 ndarray的屬性
- 2.2.2 ndarray的形狀
- 2.2.3 ndarray的類型
- 2.2.4 總結
2.2.1 ndarray的屬性
- 形狀(shape)
- ndim
- size
- 類型(dtype)
- itemsize——單個元素的大小(所占位元組數)
在建立ndarray時的預設資料類型:
- 整數:int32
- 浮點數:float64
score
score.shape
score.ndim
score.size
score.dtype
score.itemsize
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 98, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
(8, 5)
2
40
dtype('int32')
4
2.2.2 ndarray的形狀
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([1, 2, 3, 4])
c = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
a
a.shape
array([[1, 2, 3],
[4, 5, 6]])
(2, 3)
b
b.shape
array([1, 2, 3, 4])
(4,)
c
c.shape
array([[[1, 2, 3],
[4, 5, 6]],
[[1, 2, 3],
[4, 5, 6]]])
(2, 2, 3)
2.2.3 ndarray的類型
類型 | 類型代碼 | 說明 |
---|---|---|
int8、uint8 | i1、u1 | 有符号和無符号的8位(1個位元組)整型 |
int16、uint16 | i2、u2 | 有符号和無符号的16位(2個位元組)整型 |
int32、uint32 | i4、u4 | 有符号和無符号的32位(4個位元組)整型 |
int64、unint64 | i8、u8 | 有符号和無符号的64位(8個位元組)整型 |
float16 | f2 | 半精度浮點數 |
float32 | f4或f | 标準的單精度浮點數。與C的float相容 |
float64 | f8或d | 标準的雙精度浮點數。與C的double和Python的float對象相容 |
float128 | f16或g | 擴充精度浮點數 |
complex64、complex128、complex256 | c8、c16、c32 | 分别用兩個32位、64位或128位浮點數表示的複數 |
bool | ? | 存儲True和False值的布爾類型 |
object | O | Python對象類型 |
string_ | S | 固定長度的字元串長度(每個字元1個位元組)。例如,要建立一個長度為10的字元串,應使用S10 |
unicode_ | U | 固定長度的unicode長度(位元組數由平台決定)。跟字元串的定義方式一樣(如U10) |
建立數組的時候指定類型
a = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float64) # 也可以用dtype="float64"
a
a.dtype
array([[1., 2., 3.],
[4., 5., 6.]])
dtype('float64')
arr = np.array(['python', 'tensorflow', 'scikit-learn', 'numpy'], dtype=np.string_) # 也可以用dtype="string_"
arr
arr.dtype
array([b'python', b'tensorflow', b'scikit-learn', b'numpy'], dtype='|S12')
dtype('S12')
2.3 基本操作
- ndarray.方法()
- np.函數名()
2.3.1 生成數組的方法
- 生成0和1
- 從現有數組中生成
- 生成固定範圍的數組
- 生成随機數
1. 生成0和1的數組
- empty()
- empty(shape[, dtype, order])
- empty_like(a[, dtype, order, subok])
- eye(N[, M, k, dtype, order])
- identity(n[, dtype])
- ones(shape[, dtype, order])
- ones_like(a[, dtype, order, subok])
- zeros()
- zeros(shape[, dtype, order])
- zeros_like(a[, dtype, order, subok])
- full()
- full(shape, fill_value[, dtype, order])
- full_like(a, fill_value[, dtype, order, subok])
array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]], dtype=float32)
array([[1, 1, 1],
[1, 1, 1]], dtype=int64)
2. 從現有數組中生成
- np.array()——深拷貝
- np.copy()——淺拷貝
- np.asarray()——深拷貝
score
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 98, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
# np.array()
data1 = np.array(score)
data1
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 98, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
# np.asarray()
data2 = np.asarray(score)
data2
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 98, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
# np.copy()
data3 = np.copy(score)
data3
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 98, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
score[3, 1] = 1000
score
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 1000, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
data1
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 98, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
data2
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 1000, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
data3
array([[ 85, 69, 83, 76, 93],
[ 76, 84, 61, 69, 81],
[ 85, 68, 74, 69, 60],
[ 92, 98, 68, 100, 64],
[ 60, 67, 73, 92, 82],
[ 72, 61, 72, 80, 79],
[ 88, 91, 62, 95, 80],
[ 89, 71, 63, 94, 66]])
3. 生成固定範圍的數組
- np.linspace(0, 10, 100)——生成0到10之間的100個數
- 與range()不同之處在于,其生成的範圍為閉區間
- np.arange()
- range(1, 100, 5)——生成1到100之間以5為步長的一個可疊代對象——np.arange()的用法與其相似
- range()和np.arange()生成的範圍為左閉右開區間
array([ 0. , 0.52631579, 1.05263158, 1.57894737, 2.10526316,
2.63157895, 3.15789474, 3.68421053, 4.21052632, 4.73684211,
5.26315789, 5.78947368, 6.31578947, 6.84210526, 7.36842105,
7.89473684, 8.42105263, 8.94736842, 9.47368421, 10. ])
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
4.生成随機數組
- np.random子產品
- 均勻分布
-
np.random.rand(n)
傳回[0.0, 1.0)内的一組均勻分布的數
- np.random.uniform(low=0.0, high=1.0, size=None)
- 注意:
- 定義域是左閉右開
- szie:輸出樣本數目,為int或tuple類型,例如,size=(m,n,k),則輸出mnk個樣本(m、n、k為次元),預設時輸出1個值。
- 傳回值:ndarray類型,其形狀和參數size的描述一緻
- 注意:
-
- 正态分布(N(μ, σ))
- 均勻分布
均勻分布
array([0.63344521, 0.44044366, 0.51506874])
data1 = np.random.uniform(low=-1, high=1, size=(100000))
data1
array([ 0.52165485, -0.37857184, 0.85434807, ..., 0.68128425,
0.41514954, 0.97107572])
# 畫圖驗證
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8), dpi=80)
plt.hist(data1, 1000)
plt.show()
正态分布
data2 = np.random.normal(loc=1.75, scale=0.1, size=1000000)
data2
array([1.84143464, 1.73375347, 1.73293659, ..., 1.74933248, 1.80482693,
1.78291665])
# 畫圖驗證
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 8), dpi=80)
plt.hist(data2, 1000)
plt.show()
切片索引與形狀修改
案例:随機生成8支股票2周的交易日漲幅資料
import numpy as np
stock_change = np.random.normal(loc=0, scale=1, size=(8, 10))
stock_change
array([[ 0.19074222, 0.60043829, 0.80867868, 1.64073086, -0.97300847,
-0.20084744, 0.60241837, 0.44136119, 0.58028258, -0.46935001],
[-0.68734427, 0.85738838, 1.91338844, 0.58939441, 0.20555763,
-1.46895101, -0.00352442, -1.86573645, 0.94978016, -0.07536797],
[ 0.55409794, -0.76569532, -1.07678287, 0.91303802, 0.45830133,
0.41399899, 0.07469296, -0.47342359, 1.35352344, 0.37089442],
[-1.39658106, -0.4144919 , 0.72383645, 0.45637567, -0.65019515,
1.19320966, 1.24901 , -0.15086696, 0.68574793, -0.27589652],
[-0.10789621, -0.60397001, -1.26983449, 0.22412235, 0.29800482,
-1.56288488, 0.73505373, 0.88072784, -0.93668026, -0.24488789],
[-0.83122852, 0.88981107, -0.09342388, 1.45157522, -0.61855113,
-0.24583226, 1.43576482, -1.23514744, 0.48018713, -1.61807954],
[ 0.10005172, -1.27765932, -0.29108339, -0.40146452, -0.9513938 ,
-0.47696161, -0.46654499, 0.2585099 , 1.04241142, -0.75316624],
[ 0.33955043, -0.07898703, -1.32527034, 1.81189898, 1.05193552,
-0.94289232, 0.11584785, -0.58944079, 0.05561722, 0.45423719]])
2.3.2 數組的索引、切片(ndarray的索引從0開始)
# 擷取第一支股票前三個交易日的漲跌幅資料
stock_change[0, :3]
array([0.19074222, 0.60043829, 0.80867868])
三維數組如何索引?
a1 = np.array([[[1, 2, 3], [4, 5, 6]], [[12, 3, 34], [5, 6, 7]]])
a1
a1.shape
array([[[ 1, 2, 3],
[ 4, 5, 6]],
[[12, 3, 34],
[ 5, 6, 7]]])
(2, 2, 3)
34
a1[1, 0, 2] = 111
a1[1, 0, 2]
111
2.3.3 形狀修改
- ndarray.reshape(shape)——傳回新的ndarray對象,原始資料未變化
- ndarray.resize()——沒有傳回值,直接在原始資料上修改
- ndarray.T——轉置——傳回新的ndarray對象,原始資料未變化
# 需求:讓剛才的股票行、日期列翻轉,變成股票列、日期行
stock_change
array([[ 0.4757998 , 0.98262317, 0.0903228 , -2.18277494, 0.03458714,
0.27945935, -0.2996386 , 0.55731825, 2.55512752, 0.22270481],
[-0.62812039, -1.85224004, -0.03066103, 1.01028685, 0.07356365,
-0.59625804, 1.78864657, -1.02592844, -0.83059086, -0.51519111],
[ 1.59009598, -0.3561372 , -0.13415047, -0.87131372, 2.63536456,
-0.07324175, 0.11148286, -0.78896717, 0.36006041, -0.32652921],
[-3.06937049, -0.28945156, -1.31411983, 0.27797394, 0.02249254,
0.68111066, 0.19071901, 0.41827306, -0.23168617, 0.02644996],
[-0.57244906, -0.14880773, 0.05463552, 0.05554172, 0.49012011,
-0.97979408, -0.43437754, -1.16343025, -0.22472479, 2.58279491],
[ 0.25552108, 0.28857986, 0.60867223, -0.35784135, -1.32022102,
0.56756162, -1.60034249, 0.81897864, 0.90891023, 1.0347659 ],
[ 1.54397322, 2.43073741, -0.03775417, 1.01840352, -0.20048821,
-0.26870417, -0.02624917, 2.31371289, -0.03578409, -1.6612304 ],
[-1.98659864, -0.55809336, 1.69451479, 0.27399223, -0.15878832,
-0.27930802, -1.5313515 , 1.57077599, -0.3062486 , 0.33964472]])
stock_change.shape
(8, 10)
array([[ 0.4757998 , 0.98262317, 0.0903228 , -2.18277494, 0.03458714,
0.27945935, -0.2996386 , 0.55731825],
[ 2.55512752, 0.22270481, -0.62812039, -1.85224004, -0.03066103,
1.01028685, 0.07356365, -0.59625804],
[ 1.78864657, -1.02592844, -0.83059086, -0.51519111, 1.59009598,
-0.3561372 , -0.13415047, -0.87131372],
[ 2.63536456, -0.07324175, 0.11148286, -0.78896717, 0.36006041,
-0.32652921, -3.06937049, -0.28945156],
[-1.31411983, 0.27797394, 0.02249254, 0.68111066, 0.19071901,
0.41827306, -0.23168617, 0.02644996],
[-0.57244906, -0.14880773, 0.05463552, 0.05554172, 0.49012011,
-0.97979408, -0.43437754, -1.16343025],
[-0.22472479, 2.58279491, 0.25552108, 0.28857986, 0.60867223,
-0.35784135, -1.32022102, 0.56756162],
[-1.60034249, 0.81897864, 0.90891023, 1.0347659 , 1.54397322,
2.43073741, -0.03775417, 1.01840352],
[-0.20048821, -0.26870417, -0.02624917, 2.31371289, -0.03578409,
-1.6612304 , -1.98659864, -0.55809336],
[ 1.69451479, 0.27399223, -0.15878832, -0.27930802, -1.5313515 ,
1.57077599, -0.3062486 , 0.33964472]])
stock_change.shape
(8, 10)
stock_change.resize((10, 8))
stock_change.shape
(10, 8)
stock_change
array([[ 0.4757998 , 0.98262317, 0.0903228 , -2.18277494, 0.03458714,
0.27945935, -0.2996386 , 0.55731825],
[ 2.55512752, 0.22270481, -0.62812039, -1.85224004, -0.03066103,
1.01028685, 0.07356365, -0.59625804],
[ 1.78864657, -1.02592844, -0.83059086, -0.51519111, 1.59009598,
-0.3561372 , -0.13415047, -0.87131372],
[ 2.63536456, -0.07324175, 0.11148286, -0.78896717, 0.36006041,
-0.32652921, -3.06937049, -0.28945156],
[-1.31411983, 0.27797394, 0.02249254, 0.68111066, 0.19071901,
0.41827306, -0.23168617, 0.02644996],
[-0.57244906, -0.14880773, 0.05463552, 0.05554172, 0.49012011,
-0.97979408, -0.43437754, -1.16343025],
[-0.22472479, 2.58279491, 0.25552108, 0.28857986, 0.60867223,
-0.35784135, -1.32022102, 0.56756162],
[-1.60034249, 0.81897864, 0.90891023, 1.0347659 , 1.54397322,
2.43073741, -0.03775417, 1.01840352],
[-0.20048821, -0.26870417, -0.02624917, 2.31371289, -0.03578409,
-1.6612304 , -1.98659864, -0.55809336],
[ 1.69451479, 0.27399223, -0.15878832, -0.27930802, -1.5313515 ,
1.57077599, -0.3062486 , 0.33964472]])
stock_change.resize((8, 10))
stock_change.shape
stock_change
(8, 10)
array([[ 0.4757998 , 0.98262317, 0.0903228 , -2.18277494, 0.03458714,
0.27945935, -0.2996386 , 0.55731825, 2.55512752, 0.22270481],
[-0.62812039, -1.85224004, -0.03066103, 1.01028685, 0.07356365,
-0.59625804, 1.78864657, -1.02592844, -0.83059086, -0.51519111],
[ 1.59009598, -0.3561372 , -0.13415047, -0.87131372, 2.63536456,
-0.07324175, 0.11148286, -0.78896717, 0.36006041, -0.32652921],
[-3.06937049, -0.28945156, -1.31411983, 0.27797394, 0.02249254,
0.68111066, 0.19071901, 0.41827306, -0.23168617, 0.02644996],
[-0.57244906, -0.14880773, 0.05463552, 0.05554172, 0.49012011,
-0.97979408, -0.43437754, -1.16343025, -0.22472479, 2.58279491],
[ 0.25552108, 0.28857986, 0.60867223, -0.35784135, -1.32022102,
0.56756162, -1.60034249, 0.81897864, 0.90891023, 1.0347659 ],
[ 1.54397322, 2.43073741, -0.03775417, 1.01840352, -0.20048821,
-0.26870417, -0.02624917, 2.31371289, -0.03578409, -1.6612304 ],
[-1.98659864, -0.55809336, 1.69451479, 0.27399223, -0.15878832,
-0.27930802, -1.5313515 , 1.57077599, -0.3062486 , 0.33964472]])
stock_change.T
array([[ 0.4757998 , -0.62812039, 1.59009598, -3.06937049, -0.57244906,
0.25552108, 1.54397322, -1.98659864],
[ 0.98262317, -1.85224004, -0.3561372 , -0.28945156, -0.14880773,
0.28857986, 2.43073741, -0.55809336],
[ 0.0903228 , -0.03066103, -0.13415047, -1.31411983, 0.05463552,
0.60867223, -0.03775417, 1.69451479],
[-2.18277494, 1.01028685, -0.87131372, 0.27797394, 0.05554172,
-0.35784135, 1.01840352, 0.27399223],
[ 0.03458714, 0.07356365, 2.63536456, 0.02249254, 0.49012011,
-1.32022102, -0.20048821, -0.15878832],
[ 0.27945935, -0.59625804, -0.07324175, 0.68111066, -0.97979408,
0.56756162, -0.26870417, -0.27930802],
[-0.2996386 , 1.78864657, 0.11148286, 0.19071901, -0.43437754,
-1.60034249, -0.02624917, -1.5313515 ],
[ 0.55731825, -1.02592844, -0.78896717, 0.41827306, -1.16343025,
0.81897864, 2.31371289, 1.57077599],
[ 2.55512752, -0.83059086, 0.36006041, -0.23168617, -0.22472479,
0.90891023, -0.03578409, -0.3062486 ],
[ 0.22270481, -0.51519111, -0.32652921, 0.02644996, 2.58279491,
1.0347659 , -1.6612304 , 0.33964472]])
stock_change.T.shape
(10, 8)
2.3.4 類型的修改
- ndarray.astype(“type”)
- ndarray序列化到本地(轉換成bytes)
- ndarray.tostring()
array([[ 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[ 0, 0, 1, 0, 0, -1, 0, -1, 0, 0],
[ 0, 0, -1, 0, 0, 0, 0, 0, 1, 0],
[-1, 0, 0, 0, 0, 1, 1, 0, 0, 0],
[ 0, 0, -1, 0, 0, -1, 0, 0, 0, 0],
[ 0, 0, 0, 1, 0, 0, 1, -1, 0, -1],
[ 0, -1, 0, 0, 0, 0, 0, 0, 1, 0],
[ 0, 0, -1, 1, 1, 0, 0, 0, 0, 0]], dtype=int64)
numpy.ndarray
# 序列化到本地
stock_change.tostring()
b'm\xe2\xc3\xb6=j\xc8?\xce\xf3\xf4Z\xca6\xe3?\x96I\x06\x1a\xb2\xe0\xe9?\x95%M\[email protected]\xfa?\xfd\x05\xaf\xa8\xe2"\xef\xbf\x925yn^\xb5\xc9\xbf\xdd\xea`\xe1\x02G\xe3?{|\xca\x03C?\xdc?\xa9~l\xc6\xac\x91\xe2?pNW\xa2\xd4\t\xde\xbfd\xd5\x95f\xb9\xfe\xe5\xbf\xbcx\xcd\xbf\xb9o\xeb?\x8d\xa5\xba1=\x9d\xfe?t\xca\x11\xabQ\xdc\xe2?(\x8b"c\xb6O\xca?_I\x81\xc7\xd2\x80\xf7\xbf \xd4\xef\xe3?\xdfl\xbf\xcaL\xe8v\x0e\xda\xfd\xbf\xe11n[\x99d\xee?l\xf2\xd4\xbaPK\xb3\xbf\xa0\xf9\xe7\x98+\xbb\xe1?\xea\xe6={\x93\x80\xe8\[email protected]\xad\x80:\xf1\xbfZO\xb4\x84\x9b7\xed?\xc7\xc6\xf0\x1d\xcfT\xdd?\xf5/+\x9d\xf5~\xda?\xb4\xd1\xeb\xdd\x13\x1f\xb3?\xcd\x80\xd9x\x92L\xde\xbf/\xd0\xb10\x08\xa8\xf5?GC\xba\xf4\xbb\xbc\xd7?\xf1\xe5OaeX\xf6\xbf\xf2\xe0\x0f\x06\t\x87\xda\xbf\xcah\xaf\x0e\xab)\xe7?J\t\xbcJB5\xdd?\x88\xfb\x92\x0ef\xce\xe4\xbf\xaa}\xf0\x02c\x17\xf3?\x88\xfa\xed\xe8\xf1\xfb\xf3?rgD\xca\x9bO\xc3\xbf\xed#l\xa5\xa5\xf1\xe5?\x88,\xef\xe3I\xa8\xd1\xbfi\xa5\x14\x14\x16\x9f\xbb\xbf\x88\xbd\xb5\xe8\xb8S\xe3\xbf\xe7\xfc9\xf9=Q\xf4\xbf\xc9\x08\xc2\x83\n\xb0\xcc?\xe6\xfd\x85\xd2\x82\x12\xd3?\xcbyU\x93\x93\x01\xf9\xbfSw\x11g\x8f\x85\xe7?k+()\xec.\xec?\xaf5t\xe3H\xf9\xed\xbf\xeb&2\x8c|X\xcf\xbf/)\xf4\x8fl\x99\xea\xbf\[email protected]\xba\x10Uy\xec?}\x83n\xb2\xa0\xea\xb7\xbf\x0e\xe8\n\xf0\xa69\xf7?\xca\xa8B\xbc+\xcb\xe3\xbf\x07\xb5w{nw\xcf\xbf\xff\xf6\xca\x87\xe4\xf8\xf6?X\x1d%\xf5)\xc3\xf3\xbf"BP\xc9b\xbb\xde?\xa3\xfb]^\xa7\xe3\xf9\xbf\x9eO\x98B\xfd\x9c\xb9?8\xb8\x92\xe6Jq\xf4\xbf\xef|\x9b<\x1c\xa1\xd2\xbf\xfd\x03\rA\x98\xb1\xd9\xbfgg\xa8i\xd1q\xee\xbf\x90\x8f\xb5\x01\x8a\x86\xde\xbf\xd1\xe9x\x87\xdf\xdb\xdd\xbf\xcb\xa8\xad\x1bm\x8b\xd0?\x03\xff\x0e\x9a\xb7\xad\xf0?\x07\xa6\xfa\x16\xf0\x19\xe8\xbf\xd7\xe1\r\xb71\xbb\xd5?\xb3\xc5]\x80~8\xb4\xbfb\xea\xcb\xacN4\xf5\xbf\x94\xd1M\xc9\x89\xfd\xfc?\xa3\xa3\xdfW\xba\xd4\xf0?D\x92n\x81,,\xee\xbf>\x91\xb3e4\xa8\xbd?\x81bW\xee\xb2\xdc\xe2\xbfC\x10\xf4]\xdcy\xac?\x01d\t\xdf8\x12\xdd?'
2.3.5 數組的去重
- set()——集合,集合的特點:無重複項;set()隻能處理一維資料
- ndarray.unique()
temp = np.array([[1, 2, 3, 4], [3, 4, 5, 6]])
temp
array([[1, 2, 3, 4],
[3, 4, 5, 6]])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-15-b75e6adf7355> in <module>
----> 1 set(temp) # 二維數組對象不可hash!
TypeError: unhashable type: 'numpy.ndarray'
array([1, 2, 3, 4, 5, 6])
# 對temp二維數組進行降維
temp.flatten()
array([1, 2, 3, 4, 3, 4, 5, 6])
# 用set對temp進行去重,變成集合
set(temp.flatten())
{1, 2, 3, 4, 5, 6}
2.3.6 小結
- 建立數組
- 均勻
- 随機(正态分布)
- 正态分布
- 數組索引
- 改變數組的形狀
- 數組的類型
- reshape()
- resize()
- 數組的轉換
- T(轉置)
- tostring(序列化數組)
- unique(去重)
2.4 ndarray運算
- 邏輯運算
- 布爾索引
- 通用判斷函數
- np.all()——判斷元素是否全部滿足括号中的條件
- 隻要有一個False,就傳回False,隻有全部為True時才傳回True
- np.any()——判斷是否存在滿足括号中條件的元素
- 隻要有一個True就傳回True,隻有全是False時才傳回False
- np.all()——判斷元素是否全部滿足括号中的條件
- np.where()——三元運算符——用于對滿足某一條件的元素進行操作
- np.where(布爾值, True位置的值, False位置的值)
- 複合邏輯判斷需要使用np.logical_and()和np.logical_or()
- 統計運算
- 數組間運算
2.4.1 邏輯運算
- 操作符合某一條件的資料
import numpy as np
stock_change = np.random.normal(loc=0, scale=1, size=(8, 10))
stock_change
array([[-3.21925655e-01, 2.00278648e+00, 5.71029655e-02,
1.34207945e+00, 4.84536098e-01, -1.43965967e+00,
4.95406564e-02, -9.71429614e-02, -7.59968374e-01,
9.05514273e-02],
[-1.26553830e+00, 2.83480830e-01, -1.27096652e+00,
-7.78617184e-02, 2.13893026e-02, 6.99181366e-01,
1.28778436e+00, -1.21318904e+00, 1.34335913e-01,
1.16881429e+00],
[ 2.75733840e+00, -5.19397391e-01, 5.17573162e-01,
-1.03617610e+00, -9.24933387e-01, -1.25727419e+00,
-1.86879247e+00, 1.25274113e+00, -6.42865073e-01,
-1.18522669e+00],
[ 6.45555564e-01, 4.66116901e-02, 9.14255949e-01,
-1.19547375e+00, -3.57279692e-01, -5.93218153e-01,
1.13542745e-01, -8.24030471e-01, 2.24255839e-01,
-1.51164946e+00],
[ 1.62126604e-01, -1.89741507e+00, 9.96692135e-01,
-6.67635633e-01, -5.55159320e-05, 1.19218271e+00,
-1.03884252e+00, 1.25879191e+00, 7.90378065e-01,
1.02574465e+00],
[-8.79940430e-01, -8.81164598e-01, -1.32232404e+00,
1.00266248e+00, -1.31175999e-01, -9.48896084e-03,
3.66261660e-01, -7.94653378e-01, 4.10770668e-01,
1.92790586e-01],
[-1.39018777e+00, 6.05357379e-01, -6.29581411e-01,
2.13056580e+00, -8.09247972e-01, -1.45850019e+00,
8.34844616e-01, 8.84528946e-01, -1.32502380e+00,
1.13360265e+00],
[ 1.38156266e-01, 4.82065621e-02, 1.34596475e+00,
3.35030264e-01, -9.91285791e-01, 7.76555121e-01,
-3.59506728e-01, 5.55275392e-01, -8.74342910e-01,
7.75673585e-02]])
# 邏輯判斷:如果漲跌幅大于0.5,就标記為True,否則标記為False
stock_change > 0.5
array([[ True, False, False, False, False, False, False, True, False,
False],
[False, False, False, False, True, False, True, False, False,
False],
[ True, False, False, True, False, False, False, True, False,
True],
[False, False, False, False, True, True, False, True, False,
False],
[False, True, True, False, False, False, False, False, False,
True],
[False, False, True, False, False, False, True, False, False,
True],
[False, False, False, False, True, False, False, True, False,
False],
[False, False, False, False, False, False, True, True, False,
True]])
array([1.00960969, 1.57389795, 1.48681292, 0.60698373, 1.32107651,
0.61409015, 0.55504015, 0.75181457, 1.19916094, 1.60054449,
2.15595143, 0.81875222, 1.14852624, 3.00824751, 1.41247279,
0.50834144, 0.54149304, 1.24179072, 0.84150558, 1.3810949 ,
0.63093414, 1.45875392])
stock_change[stock_change > 0.5] = 1.1
stock_change
array([[ 1.1 , -0.84427887, 0.02493903, -1.25251943, -0.13526056,
-1.53708678, -1.75029022, 1.1 , -0.11252094, -1.10968466],
[-1.77263505, -1.10273485, -0.06427946, -0.47530352, 1.1 ,
0.39431243, 1.1 , -0.19107432, -0.30473289, -0.36641659],
[ 1.1 , -0.74977583, -1.86773114, 1.1 , -0.82844641,
-0.13917232, 0.39855819, 1.1 , 0.23328748, 1.1 ],
[ 0.13270216, 0.08033047, -0.09144296, -1.1299997 , 1.1 ,
1.1 , -1.14426555, 1.1 , -0.70135722, -1.56731264],
[-1.48992779, 1.1 , 1.1 , 0.27689299, 0.30363445,
-0.01249626, 0.37981243, -0.3862383 , -0.19437319, 1.1 ],
[-0.17573471, -0.69522921, 1.1 , 0.14881664, 0.24209382,
0.43842094, 1.1 , 0.19444138, -0.85873745, 1.1 ],
[ 0.11826444, -0.79209097, -0.22540633, -0.03265994, 1.1 ,
-0.53830251, -0.21617814, 1.1 , -0.18148514, -0.35653799],
[-0.15744953, -0.07925474, -1.31580327, -0.53460345, -0.76964669,
-1.49762656, 1.1 , 1.1 , 0.41973408, 1.1 ]])
2.4.2 通用判斷函數
- np.all()
- np.any()
# 判斷stock_change[0:2, 0:5]是否全是上漲的
stock_change[0:2, 0:5]
stock_change[0:2, 0:5] > 0
array([[ 1.1 , -0.84427887, 0.02493903, -1.25251943, -0.13526056],
[-1.77263505, -1.10273485, -0.06427946, -0.47530352, 1.1 ]])
array([[ True, False, True, False, False],
[False, False, False, False, True]])
False
# 判斷前5支股票是否有上漲的
stock_change[:5, :] > 0
np.any(stock_change[:5, :] > 0)
array([[ True, False, True, False, False, False, False, True, False,
False],
[False, False, False, False, True, True, True, False, False,
False],
[ True, False, False, True, False, False, True, True, True,
True],
[ True, True, False, False, True, True, False, True, False,
False],
[False, True, True, True, True, False, True, False, False,
True]])
True
2.4.3 np.where()——三元運算符
- np.where(布爾值, True位置的值, False位置的值)
# 判斷前四個股票前四天的漲跌幅,大于0的置為1,否則置為0
temp = stock_change[:4, :4]
temp
array([[-0.32192566, 2.00278648, 0.05710297, 1.34207945],
[-1.2655383 , 0.28348083, -1.27096652, -0.07786172],
[ 2.7573384 , -0.51939739, 0.51757316, -1.0361761 ],
[ 0.64555556, 0.04661169, 0.91425595, -1.19547375]])
array([[1, 0, 1, 0],
[0, 0, 0, 0],
[1, 0, 0, 1],
[1, 1, 0, 0]])
複合邏輯判斷
- 需要使用np.logical_and()和np.logical_or()
# 判斷前四個股票前四天的漲跌幅,大于0.5且小于1的,置換為1,否則置換為0
# 判斷前四個股票前四天的漲跌幅,大于0.5或者小于-0.5的,置換為1,否則置換為0
np.logical_and(temp > 0.5, temp < 1)
np.logical_or(temp > 0.5, temp < -0.5)
array([[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False]])
array([[ True, True, False, True],
[ True, True, False, False],
[ True, True, True, True],
[False, False, False, True]])
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0]])
2.4.4 統計運算
- 統計名額函數
- min, max, mean, median, var, std
- np.函數名(temp[, axis=])
- ndarray.方法名([axis=])
- min, max, mean, median, var, std
- 傳回最大、最小值所在的位置
- np.argmax(temp, axis=)
- np.argmin(temp, axis=)
案例:股票漲跌幅統計運算
統計前四支股票前四天的漲跌幅
# 前四支股票前四天的最大漲幅
temp
temp.shape
array([[-0.32192566, 2.00278648, 0.05710297, 1.34207945],
[-1.2655383 , 0.28348083, -1.27096652, -0.07786172],
[ 2.7573384 , -0.51939739, 0.51757316, -1.0361761 ],
[ 0.64555556, 0.04661169, 0.91425595, -1.19547375]])
(4, 4)
2.75733839736208
2.75733839736208
# 指定按行求最大值
temp.max(axis=1) # 按照第二個次元(axis=1)
# 也可以寫成axis=-1(倒數第一個次元)
temp.max(axis=-1)
array([2.00278648, 0.28348083, 2.7573384 , 0.91425595])
array([2.00278648, 0.28348083, 2.7573384 , 0.91425595])
# 按列求最大值
np.max(temp, axis=0) # 按照第一個次元(axis=0)
# 也可以寫成axis=-2(按照倒數第二個次元)
np.max(temp, axis=-2)
array([2.7573384 , 2.00278648, 0.91425595, 1.34207945])
array([2.7573384 , 2.00278648, 0.91425595, 1.34207945])
顯示最大、最小值所在的位置
array([1, 1, 0, 2], dtype=int64)
array([1, 2, 1, 3], dtype=int64)
2.5 數組間運算
- 2.5.1 場景
- 2.5.2 數組與數的運算
- 2.5.3 數組與數組的運算
- 2.5.4 廣播機制
- 2.5.5 矩陣運算
- 什麼是矩陣
- 矩陣乘法運算
- 矩陣的應用場景
2.5.1 場景
資料:
[[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]]
2.5.2 數組與數的運算
arr = np.array([[1, 2, 3, 2, 1, 4], [5, 6, 1, 2, 3, 1]])
arr
array([[1, 2, 3, 2, 1, 4],
[5, 6, 1, 2, 3, 1]])
array([[11, 12, 13, 12, 11, 14],
[15, 16, 11, 12, 13, 11]])
2.5.3 數組與數組的運算
arr1 = np.array([[1, 2, 3, 2, 1, 4], [5, 6, 1, 2, 3, 1]])
arr2 = np.array([[1, 2, 3, 4], [3, 4, 5, 6]])
arr1 # (2, 6)
arr2 # (2, 4)
array([[1, 2, 3, 2, 1, 4],
[5, 6, 1, 2, 3, 1]])
array([[1, 2, 3, 4],
[3, 4, 5, 6]])
# 形狀不同的數組間無法運算
arr1 + arr2
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-20-39b93f7d8e72> in <module>
1 # 形狀不同的數組間無法運算
----> 2 arr1 + arr2
ValueError: operands could not be broadcast together with shapes (2,6) (2,4)
2.5.4 廣播機制
執行broadcast的前提在于,兩個ndarray執行的是element-wise的運算,Broadcast的功能是為了友善不同形狀的ndarray進行數學運算
當操作兩個數組時,numpy會逐個比較它們的shape(構成的元組tuple),隻有在下述情況下,兩個數組才能夠進行數組間運算:
- 次元相等
- shape(其中相對應的一個地方為1)
# 從後往前排,滿足廣播機制,可以進行運算,運算之後每個次元的元素數量為每個次元上元素最多的運算數組的元素數量
arr1 = np.array([[1, 2, 3, 2, 1, 4], [5, 6, 1, 2, 3, 1]])
arr2 = np.array([[1], [3]])
arr1 # (2, 6)
arr2 # (2, 1)
arr1 + arr2 # (2, 6)
arr1 / arr2 # (2, 6)
array([[1, 2, 3, 2, 1, 4],
[5, 6, 1, 2, 3, 1]])
array([[1],
[3]])
array([[2, 3, 4, 3, 2, 5],
[8, 9, 4, 5, 6, 4]])
array([[1. , 2. , 3. , 2. , 1. ,
4. ],
[1.66666667, 2. , 0.33333333, 0.66666667, 1. ,
0.33333333]])
2.5.5 矩陣運算
- 存儲矩陣的兩種方法:
- ndarray二維數組
- matrix資料結構
1. 什麼是矩陣
- np.mat()
- 将數組轉化為矩陣
# ndarray存儲矩陣
data = np.array([[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]])
data
# matrix存儲矩陣
data_mat = np.mat(data)
data_mat
array([[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]])
matrix([[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]])
numpy.matrix
2. 矩陣的乘法運算
- 形狀
- (m, n) * (n, l) = (m, l)
- 運算規則
- 數乘
- 點積
- 叉積
- api
- ndarray存儲的矩陣:
- np.matmul()
- np.dot()
- mat存儲的矩陣:
- mat1 * mat2
- ndarray存儲的矩陣:
# 檢視分數矩陣
data_mat
data_mat.shape
matrix([[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]])
(8, 2)
# 建立分數權重
weights = np.array([[0.3], [0.7]])
weights_mat = np.mat(weights)
weights_mat
weights_mat.shape
matrix([[0.3],
[0.7]])
(2, 1)
data
weights
array([[80, 86],
[82, 80],
[85, 78],
[90, 90],
[86, 82],
[82, 90],
[78, 80],
[92, 94]])
array([[0.3],
[0.7]])
# 用數組進行矩陣的乘法運算——np.matmul()
np.matmul(data, weights)
np.matmul(data, weights).shape
array([[84.2],
[80.6],
[80.1],
[90. ],
[83.2],
[87.6],
[79.4],
[93.4]])
(8, 1)
# 用數組進行矩陣的乘法運算——np.dot()——求點積
np.dot(data, weights)
np.dot(data, weights).shape
array([[84.2],
[80.6],
[80.1],
[90. ],
[83.2],
[87.6],
[79.4],
[93.4]])
(8, 1)
# 用數組進行矩陣的乘法運算——@
data @ weights
(data @ weights).shape
array([[84.2],
[80.6],
[80.1],
[90. ],
[83.2],
[87.6],
[79.4],
[93.4]])
(8, 1)
# 用矩陣mat進行矩陣的乘法運算
data_mat * weights_mat
(data_mat * weights_mat).shape
matrix([[84.2],
[80.6],
[80.1],
[90. ],
[83.2],
[87.6],
[79.4],
[93.4]])
(8, 1)
2.6 合并與分割
2.6.1 合并
- numpy.hstack(tup)----Stack arrays in sequence horizontally (column wise).
- numpy.vstack(tup)----Stack arrays in sequence vertically (row wise).
- numpy.concatenate((a1, a2, …), axis=0)
a = stock_change[:2, :4]
b = stock_change[4:6, :4]
a
b
array([[-0.32192566, 2.00278648, 0.05710297, 1.34207945],
[-1.2655383 , 0.28348083, -1.27096652, -0.07786172]])
array([[ 0.1621266 , -1.89741507, 0.99669214, -0.66763563],
[-0.87994043, -0.8811646 , -1.32232404, 1.00266248]])
array([[-0.32192566, 2.00278648, 0.05710297, 1.34207945, 0.1621266 ,
-1.89741507, 0.99669214, -0.66763563],
[-1.2655383 , 0.28348083, -1.27096652, -0.07786172, -0.87994043,
-0.8811646 , -1.32232404, 1.00266248]])
array([[-0.32192566, 2.00278648, 0.05710297, 1.34207945, 0.1621266 ,
-1.89741507, 0.99669214, -0.66763563],
[-1.2655383 , 0.28348083, -1.27096652, -0.07786172, -0.87994043,
-0.8811646 , -1.32232404, 1.00266248]])
array([[-0.32192566, 2.00278648, 0.05710297, 1.34207945],
[-1.2655383 , 0.28348083, -1.27096652, -0.07786172],
[ 0.1621266 , -1.89741507, 0.99669214, -0.66763563],
[-0.87994043, -0.8811646 , -1.32232404, 1.00266248]])
array([[-0.32192566, 2.00278648, 0.05710297, 1.34207945],
[-1.2655383 , 0.28348083, -1.27096652, -0.07786172],
[ 0.1621266 , -1.89741507, 0.99669214, -0.66763563],
[-0.87994043, -0.8811646 , -1.32232404, 1.00266248]])
2.6.2 分割
- numpy.split(array, indices_or_sections, axis=0)----Split an array into multiple sub-arrays
2.7 I/O操作與資料處理
- 2.7.1 Numpy讀取
- 2.7.2 缺失值處理
- 什麼是缺失值
- 缺失值處理
- 兩種思路:
- 删除含有缺失值的樣本
- 替換、插補
- 兩種思路:
2.7.1 Numpy讀取
# 讀取資料
data = np.genfromtxt('test.csv', delimiter=',') # delimiter指定分隔符
array([[ nan, nan, nan, nan],
[ 1. , 123. , 1.4, 23. ],
[ 2. , 110. , nan, 18. ],
[ 3. , nan, 2.1, 19. ]])
2.7.2 缺失值處理
- 什麼是缺失值
- 當讀取的本地檔案為float的時候,如果有缺失值(或者為None),就會出現nan
- 缺失值處理
- 單純地把nan替換為0:如果替換前資料的均值大于0,那麼替換之後均值會變小
- 更一般地:把缺失的資料替換為均值(中值)或者删除含有缺失值的行(資料的清洗)
nan
numpy.float64
# 用均值填補的處理邏輯
def fill_nan_by_column_mean(t):
for i in range(t.shape[1]): # 在第二個次元上進行操作,按列求均值
# 計算nan的個數
nan_num = np.count_nonzero(t[:, i][t[:, i] != t[:,i]]) # nan具有不等于自身的特性
if nan_num > 0: # 如果存在nan元素
now_col = t[:, i]
# 求和
now_col_not_nan = now_col[np.isnan(now_col) == False].sum()
# 求均值
now_col_mean = now_col_not_nan / (t.shape[0] - nan_num)
# 指派給now_col
now_col[np.isnan(now_col)] = now_col_mean
# 把now_col指派給t,重新整理t的目前列
t[:, i] = now_col
return t
data
array([[ nan, nan, nan, nan],
[ 1. , 123. , 1.4, 23. ],
[ 2. , 110. , nan, 18. ],
[ 3. , nan, 2.1, 19. ]])
array([[ 2. , 116.5 , 1.75, 20. ],
[ 1. , 123. , 1.4 , 23. ],
[ 2. , 110. , 1.75, 18. ],
[ 3. , 116.5 , 2.1 , 19. ]])