Pandas å ¥é¨
两个æ°æ®ç±»åï¼seriesåDataFrame
Series
1. å建Series对象
import pandas as pd
ser1 = pd.Series(data=[], index=[])
2. Series对象çå±æ§
- ç´¢å¼ï¼ser1.index() â> ç´¢å¼çå¼ï¼ ser1.index.values
- å¼ï¼ser1.values
- å¤ææ¯å¦åå¨ç©ºå¼ï¼ser1.hasnans â>è¿åå¸å°å¼
- å¤æç³»åä¸æ°æ®æ¯å¦ç¬ä¸æ äºï¼ ser1.is_unique â>è¿åå¸å°å¼
-
å¤ææ°æ®æ¯å¦åè°éå¢/éåï¼
ser1.is_monotonic_increasing / ser1.monotonic_decreasing
3. Series对象çæ¹æ³
-
è·åæè¿°æ§ç»è®¡ä¿¡æ¯ â>éä¸è¶å¿
æ±åï¼ser.sum()
å¹³åå¼ï¼ser.mean()
ä¸ä½æ°ï¼ser.median()
ä¼æ°ï¼ser.mode()
- è·åæè¿°æ§ç»è®¡ä¿¡æ¯ â> 离æ£è¶å¿
- æ大å¼/æå°å¼ï¼ ser.max() / ser.min()
- æ¹å·®ï¼ ser.var()
- æ åå·®ï¼ser.std()
- ååä½æ°ï¼ser.quantileï¼0.25/0.5/0.75ï¼
-
è·åå¤ä¸ªæè¿°æ§ä¿¡æ¯ï¼
ser.describeï¼ï¼
- å»é
- â>è¿åæ°ç»å¯¹è±¡ï¼ ser.unique()
- â>è¿åSeries对象ï¼ser.drop_duplicates( keep=firstï¼ inplace=False)
-
keepåæ°ï¼first --> ä¿ç第ä¸æ¬¡åºç°çå¼ï¼
â last -->ä¿çæåä¸æ¬¡åºç°çå¼ï¼
â False --> å é¤æéå¤åºç°çå¼ï¼ä¸ä¸ªé½ä¸ç
-
inplaceåæ°ï¼Trueï¼å°±å°å é¤ï¼ååºåæ¹å
â Falseï¼è¿åä¸ä¸ªæ°çåºåï¼ååºåä¸å
-
-
å»éåå ç´ ç个æ°ï¼
ser.nunique() â >è¿åint
-
æ¯ä¸ªå ç´ åºç°çé¢æ¬¡ï¼æé¢æ¬¡éåºæåï¼
ser.values_counts()
4. 空å¼çå¤ç
- å¤æ空å¼
- å¤æ空å¼ï¼ ser.isnull() â> è¿åå¸å°å¼ï¼æ¯ä¸ªå¼é½å¤æï¼è¿åå¸å°åºåï¼
- å¤æé空å¼ï¼ ser.notnull() â> è¿åå¸å°å¼ï¼æ¯ä¸ªå¼é½å¤æï¼è¿åå¸å°åºåï¼
- åºç¨ï¼éè¿å¸å°ç´¢å¼çéé空å¼ï¼
- serï¼ser.notnull( ) )
- å é¤ç©ºå¼
- ser.dropna(inplace = True )
- å é¤æå®æ°å¼ï¼ ser.dropï¼index=[0, 1, 2]ï¼
- å¡«å
空å¼
- å¡«å æå®æ°å¼ï¼ser.fillnaï¼50ï¼
- å°±è¿å¡«å
ï¼ser.fillnaï¼method=ffill / bfillï¼
- methodåæ°ï¼å®ä¹å¡«å
æ¹å¼
- ffill â> ç¨ç©ºå¼åé¢çæ°å¼å¡«å
- bfill â> ç¨ç©ºå¼åé¢çæ°å¼å¡«å
- ser.fillnaï¼ffillï¼.fillnaï¼bfillï¼ â> å ç¨åé¢çå¼å¡«å ï¼åç¨åé¢çå¼å¡«å
- methodåæ°ï¼å®ä¹å¡«å
æ¹å¼
5. æåº
- æç´¢å¼æåºï¼ser.sort_index(ascending = True)
- æå¼æåºï¼ser.sort_values(ascending = Trueï¼
-
ascendingåæ°ï¼True â> ååºæåï¼é»è®¤ï¼
â False â> éåºæå
-
- Top - N
- åNå¤§ï¼ ser.nlargestï¼3ï¼ â> å3大
- åNå°ï¼ ser.nsmallestï¼3ï¼â> å3å°
6. æ å° â> map()
â 类似äºpythonä¸mapé«é¶å½æ°çç¨æ³
â 示ä¾ï¼
ser = pd.Series([40,74,64,54,32])
newser = ser.map(lambda x: x ** 0.5 * 10)
â å®é 使ç¨ä¸ä¸ºé¿å æäºå¼å¸¸å¼å¯¹è¯ä¼°ç»æçå½±åï¼å¸¸éç¨ çº¿æ§å½ä¸å å é¶åå¼å½ä¸å对æ°æ®è¿è¡å¤çï¼
-
线æ§å½ä¸åï¼
X â² = X i â X m i n X m a x â X m i n X' = \frac{X_i - X_{min}} {X_{max} - X_{min}} Xâ²=XmaxââXminâXiââXminââ
x_i为åºåä¸çå¼ï¼X_max为åºåä¸æ大å¼ï¼X_min为åºåä¸æå°å¼
ç¨ä»£ç å®ç°ï¼
x_max, x_min = ser1.max(), ser1.min() ser1.map(lambda x: (x - x_min) / (x_max - x_min))
-
é¶åå¼å½ä¸åï¼
X â² = X i â μ Ï X' = \frac{X_i - \mu} {\sigma} Xâ²=ÏXiââμâ
x_i为åºåä¸çå¼ï¼Î¼ä¸ºåºåå¹³åå¼ï¼Ï为åºåä¸çæ åå·®
ç¨ä»£ç å®ç°ï¼
μ, Ï = ser1.mean(), ser1.std() ser1.map(lambda x: (x - μ) / Ï)
7. ç»å¾ â> polt()
-
è¡¥å ï¼å¯¹æ°æ®è¿è¡æ±æ»
ser.groupbyï¼level= 0/1/2ï¼.sum( )
- levelåæ°ï¼æå 级ææè¿è¡æ±æ»
- æ±ç¶å¾
ser1.plot(kind='bar', color=['red','green','blue']) plt.xticks(rotation=0) # 设置xè½´æ è¯åæ°´å¹³æ¹å plt.grid(True, alpha=0.5, axis='y', linestyle=':') # 设置ç½æ ¼çº¿ plt.show() # åæ°ï¼ kind --> å¾è¡¨çç±»åï¼ color --> å¾å½¢çé¢è²
- 饼ç¶å¾
temp.plot(kind='pie', autopct='%.2f%%') plt.ylabel('') plt.show() # åæ°ï¼autopct --> æ°æ®æ 注
DataFrame
1. å建DataFrame
- 读åcsvæ件å建DataFrame对象
df = pd.read_csv( r'D:\QianFeng.edu\Part5_BI\Day4\files\2018å¹´å京积åè½æ·æ°æ®.csv', usecols=['name', 'birthday', 'company', 'score'], sep=',', encoding='utf-8', skiprows = range(1, 11), nrows=10 )
- åæ°ï¼
- usecols â> è¦è¯»åçåå
- sep â> æå®æ件çé¢åé符ï¼é»è®¤ä¸ºéå·
- enconding â> æ件ç¼ç æ¹å¼ï¼é»è®¤ä¸ºutf-8
- skiprows â> è·³è¿åªäºè¡ä¸åï¼ä¸ºè¡çå ·ä½è¡å·ï¼å¯ä»¥æ¯listærange
- nrows â> è¦åå¤å¥½è¡æ°æ®
- åæ°ï¼
- 读åExcelæ件å建DataFrame对象
df = pd.read_excel( r'D:\QianFeng.edu\Part5_BI\Day4\files\2020å¹´éå®æ°æ®.xlsx', sheet_name='Sheet1', header=0, usecols=['éå®æ¥æ','éå®åºå','éå®æ¸ é','éå®æ°é'] )
- åæ°ï¼
- sheet_name:è¦æå¼çsheetçåå
- header:æå®è¡¨å¤´æå¨çè¡ï¼é»è®¤ä¸º0è¡
- å ¶ä»çä¸è¯»åCSVæ件ç¸å
- åæ°ï¼
- éè¿SQL读åæ°æ®åºæ°æ®å建DataFrameæ件
import pymysql conn = pymysql.connect(host='', port=3306, user='root', password='123456', database='hrs', charset='utf8mb4') dept_df = pd.read_sql( 'select dno as no, dname as name, dloc as location from tb_dept', conn, index_col='no')
- åæ°ï¼indexââcolï¼æå®ç´¢å¼åï¼ä¸æå®åé»è®¤æ·»å 0,1,2â¦ä¸ºç´¢å¼
- DataFrameçç¸å
³æä½
-
æ¥çDataFrameçä¿¡æ¯
df.info( )
-
æ¥çDataFrameå / åå è¡
df.head( ï¼ / df.tail(ï¼
- åDataFrameä¸çå
- df.[âenameâ] â> df.[åå]
- ä¹å¯éè¿è±å¼ç´¢å¼çæ¹æ³åå¤åï¼df.loc[ [ç´¢å¼1]ï¼[ç´¢å¼1] ] â> emp_df[ [âenameâ, âsalâ] ]
- åDataFrameä¸çè¡
- df.loc[1359] â> df.loc[ç´¢å¼]
- ä¹å¯éè¿è±å¼ç´¢å¼çæ¹æ³åå¤è¡ï¼df.loc[ [ç´¢å¼1]ï¼[ç´¢å¼1] ] â> emp_df.loc[[1359,5566]]
- åçåè¡ï¼df.loc[1359:3211]
-
ä¿®æ¹åå°çå¼
df.loc[1359, âenameâ] = âè¡ä¸éâ â> df.loc[ç´¢å¼ï¼ åå] = å¼
-
ç»DataFrameæ·»å å
df [åå] = å¼ â> è¥å¼ä¸ºå®å¼ï¼åä¼å¹¿æç»ææè¡ï¼è¥ä¸ºä¸ä¸ªåºåï¼åä¸æ¯ä¸è¡ä¸ä¸å¯¹åº
- ç»DataFrameæ·»å è¡
-
æ·»å å ¨é¨æ°æ®
df.loc[8899] = [âé·æ´â,âå·¥ç¨å¸â,7800,20000,10000,20,âæªå©â] â >df .loc[ç´¢å¼] = å å«ææå¼çåºå
-
æ·»å é¨åæ°æ®ï¼æªç¥çæ°æ®ä¸ºé»è®¤ç¨NaN
df. loc[7799] = {âenameâ:âç大é¤â, âmgrâ: 3088, âsalâ: 5000} â > df.loc[ç´¢å¼] = å å«é¨åå¼çåå ¸
-
-
å é¤å
df.dropï¼columns = ååï¼ â> å é¤å¤åï¼åå为ä¸ä¸ªåºå
-
å é¤è¡
df.dropï¼index = ç´¢å¼ï¼ â> å é¤å¤è¡ï¼ç´¢å¼ä¸ºä¸ä¸ªåºå
-
ä¿®æ¹è¡¨å¤´
df.renameï¼columns =å å«ä¿®æ¹ååååçåå ¸ï¼ --> df.rename(columns={âenameâ:âå§åâ})
-
ä¿®æ¹DataFrameçè¡ / å 顺åº
df. reindexï¼index / columns = å å« è¡ / å æ°é¡ºåºçåºåï¼ â> df.reindex(columns=[âenameâ,âjobâ,âsalâ,âcommâ,âdnoâ,âmgrâ,âå©å§»ç¶åµâ])
-
é置索å¼
df. reset_indexï¼ï¼
-
设置索å¼
df. set_index( é设置æç´¢å¼çåå )
-