Python Pandas 學習筆記

2020-06-02

一陣子沒有用 Python ，會使用的機會大多是用來編輯 Scripts 或者作為資料的 ETF 用途。而每當要 ETF 的時候都會回憶起 Stata 的便利，肌肉隱約就可以呼喚出各式操作資料的指令。只是離開學術環境後就不再使用過 Stata，取而代之的是 Python 的 Pandas ，儘管指令上兩者有著極大的差別，但因為 Python 有著更多更方便的 Library，同時語法上也更適合寫 Scripts，何況還是 OpenSource 的，既然如果也沒有什麼好念舊的，認分的學習 Pandas吧。

Quickly Test Data

df = pd.DataFrame(
  np.resize(np.arange(0, 16), (4, 4)), 
  index = ['a', 'b', 'c', 'd'], 
  columns = ['c1', 'c2', 'c3', 'c4'])

證期會開放資料

import pandas as pd
df = pd.read_csv('http://www.twse.com.tw/exchangeReport/STOCK_DAY_ALL?response=open_data')

# https://data.gov.tw/dataset/11549

Python Metasyntactic variable

spam = ham = eggs = 42

Wiki
catb

Ipython Magic

%cd
%run
%timeit
%matplotlib inline

DataFrame Filter 資料篩選

Pandas 的資料篩選是利用 mask 技巧，mask 其實就是 Pandas 中的 Series 物件，只是對應著各列的 True 與 False，藉由 mask 對資料就可以做篩選。

import pandas as pd
df = pd.read_excel('data.xlsx')

mask = df['score'] >= 50
print(df[mask])

字串比較

mask = df['category'].str.strip() == 'drinks'

要在 Pandas 中進行一連串的字串可以是利用 | 來 join 字串。

categories = ['drinks', 'foods', 'eletronics']
mask = df['category'].str.contains('|'.join(categories))

Series str

Chceck Column match list elements

df['column'].isin(['keyword1', 'keyword2', 'keyword3'])
df['證券名稱'].isin(['兆豐金', '中鋼'])

字串資料轉為數字 (astype)

df['StringColumn'].replace(',', '').astype('int')

產生 DataFrame

利用 namedtuple 可以用物件的角度來產生 DataFrame。

import pandas as pd
from collections import namedtuple
EntryClass = namedtuple('EntryClass', ['col1', 'col2', 'col3'])
pd.DataFrame([EntryClass(...), EntryClass(...), EntryClass(...)])

DataFrame 取出 Cell Value (at)

df.at['rowName', 'colName']

同時選取 Row 與 Col 範圍 DataFrame (loc)

#df.loc[rowRange, colRange]
df.loc[:, 'x2':'x4']

DataFrame Row 新增資料 (loc)

df.loc[newIndex] = [col1, col2, ...]

DataFrame Row 修改資料 (loc)

df.loc[IndexName] = [col1, col2, ...]
df.iloc[indexValue] = [col1, col2, ...]

DataFrame Column 根據特定條件寫入資料 (loc)

df.loc[mask, 'ColumnName'] = 1

Map Value from DataFrame, Series to Series (apply)

df['Column'].apply(lambda x : 'T' if x else 'F')
# Return Series According to lambda

df.apply(lambda row : row['Column'], axis = 1)
# Return Series According to lambda

DataFrame Rename Column (rename)

df.rename(columns = {'c2': 'c2!'})

DataFrame 基礎敘述性統計 (descrbie)

df.describe()

DataFrame 最大值前 n 筆 (nlargest, nsmallest)

df.nlargest(5, 'c2')
# 最小值
# df.nsmallest(5, 'c2')

DataFrame Column to Numpy Array (values)

df['c1'].values

欄位資料分布檢視 (value_counts)

df['columnName'].value_counts()

資料分組資料分布檢視 (groupyby , size)

df.groupby('columnName').size()

groupby

重新設定索引 (reindex)

df.reindex([...])

利用索引填補資料 (ffill, bfill)

df = pd.Series(['b', 'r', 'g'], index = [1, 3, 5])
df.reindex(range(8), method='ffill')

排序資料 Order DataFrame By Specific Column (sort_values)

df.sort_values('col2')

Filter Column Name By Regex (filter)

df.filter(regex = 'c[2|3]')

刪除重複欄位值的列 Delete Row by duplicated column value (drop_duplicates

df.drop_duplicates('c3')

刪除特定列 Delete Row by index value (drop)

df.drop(['100201', '100205'])

刪除特定欄 Delete Row by column index value (drop)

df.drop(['col1', 'col2'], axis = 1)

各欄加總 Sum of Column (sum)

df.sum()

各列加總 Sum of Row (sum)

df.sum(axis = 1)

Run ipynb file with terminal

jupyter nbconvert --to python nb.ipynb

StackOverFlow

儲存為 csv 檔案並且不包含 index 欄

df.to_csv('filename.csv', index = False)

csv with quoting (to_csv, quoting, QUOTE_NONNUMERIC)

import csv
df.to_csv('filename.csv', quoting=csv.QUOTE_NONNUMERIC)

新增 column, create new column

可能會發生：

SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.

method 1

df['newCol'] = ('String' + df.col1.str.slice(3))

method 2

df.loc[:,'newCol'] = ('String' + gs.col1.str.slice(3))

參考資料

Pandas Chearsheet