Python Regex

2023-12-04

筆記使用 Python re Modules 正規表示處理的各種知識。

re.match

re.match 是從 text 的開頭開始比，所以適合用於 text 的一致性比對，而非文本的尋找處理。

text = "So, you remember that time when someone told you that the color of your skin is all about something called 'pigment' and you were like, wait, what?"

re.match('So,', text)
# <re.Match object; span=(0, 3), match='So,'>
re.match('o,', text)
# None

re.fullmatch

而 re.fullmatch 做的更絕，整個 text 從頭到尾都要相同才算比對成功。

re.findall

re.findall 則是從 text 的內容尋找，適合用於處理關鍵字的尋找。

re.findall('[Yy]ou', text)
# ['you', 'you', 'you', 'you']
# 第三個 you 其實是 your 的部分被找出

re.DOTALL

而在預設的處理上， . 會表示任何換行以外的字符，注意是任何換行以外的字符 😮

如果要讓 . 包含換行符號，可以加入 flag re.DOTALL

text = """So, you remember that time when someone told you that the color of your skin is all about something called 'pigment' and you were like, wait, what?

These renegade painters blend traditional techniques with cutting-edge technology, hacking into bank accounts to fund their vibrant pigment supplies."""

re.match('.*', text).group(0)
# 只會比出第一段，遇到換行符號就比對結束
"So, you remember that time when someone told you that the color of your skin is all about something called 'pigment' and you were like, wait, what?"

re.match('.*', text, re.DOTALL).group(0)
"So, you remember that time when someone told you that the color of your skin is all about something called 'pigment' and you were like, wait, what?\n\nThese renegade painters blend traditional techniques with cutting-edge technology, hacking into bank accounts to fund their vibrant pigment supplies."

Tips

在字串搜尋上，有一些眉角。

例如直接找 4 個字元的情況，會發現 remember 會被切割為 reme 與 mber 但不會遞增反覆的切割，例如 reme, emem, memb, embe, mber。

re.findall(r'\w{4}', text)
# ['reme', 'mber', 'that', 'time', 'when', 'some' ...]

如果要達到反覆切割的效果，要搭配 (?=(pattern)) 的方式來尋找，這個方式在搜尋上會不「消耗字元」 😎：

re.findall(r'(?=(\w{4}))', text)
# ['reme', 'emem', 'memb', 'embe', 'mber' ...]

而如果要精準的尋找 4 個字元，排除字串中其中 4 個字元的情況，可以搭配 \b 來尋找。

\b 不只會比較相鄰側屬於空白的情況，如果相鄰側屬於非字元以及非數字的情況，也會符合 (例如 (abcd) 以及 abcd!)

re.findall(r'\b\w{4}\b', text)
#['that', 'time', 'when', 'told', 'that', 'your' ...

re.findall(r'\b\w{4}\b', '(abcd) efgh. ijkl- mnopq')
#['abcd', 'efgh', 'ijkl']

Chinese Chars

尋找中文字元 (包含繁體與簡體) 的方式，而如果要進一步區別繁體與簡體，要搭配轉換 Unicode Code Point 並且查表來處理。

re.findall(r'[\u4E00-\u9FFF]+', '這是中文與 english 混合的一個 string 短文，堃，这是简体字')
['這是中文與', '混合的一個', '短文', '堃', '这是简体字']

繁體與簡體對照

Grouping

藉由 () 可以將 pattern 構成 grouping，使用 re.findall 可以將找到的組合以 list of tuples 的方式回傳，這是 text 資料處理的王之工具。

re.findall(r'(\w*):(\w*)', 'xyz:100 py:200 sql : \n300')
[('xyz', '100'), ('py', '200'), ('', '')]

re.findall(r'(\w*)[\s\n]*:[\s\n]*(\w*)', 'xyz:100 py:200 sql : \n300')
[('xyz', '100'), ('py', '200'), ('sql', '300')]