Python Selenium (爬蟲工具系列)
2024-01-09
筆記自動化與爬蟲工具系列之 Selenium Library,如何利用自動化與爬蟲技術提高工作效率。
說明
一直以來碰到 ChromeDriver 以及 ChromeVersion 所苦,前置作業確認有安裝 Chrome 並且全域安裝以下 library:
pip install webdriver-manager
將 Selenium 初始化獨立封裝在單一程式,提供其他程式使用。
SeleniumDriver.py
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
# pip install webdriver-manager
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--start-maximized")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1920,1080")
chrome_options.add_argument("--log-level=3")
custom_user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.1234.56 Safari/537.36"
chrome_options.add_argument(f"user-agent={custom_user_agent}")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)
其中 --headless
讓 Selenium 在背景作業,對於自動化來說非常方便,也可以取消監督作業的畫面,確認無誤後再放手交給背景作業。
app.py
import requests
import os
import threading
from seleniumDriver import driver, custom_user_agent, By
driver.get('https://imgs.sdwh.dev')
elements = driver.find_elements(By.CLASS_NAME, 'className')
def download_image(png_url):
response = requests.get(png_url, headers={'User-Agent': custom_user_agent})
if response.status_code == 200:
file_name = os.path.basename(png_url)
with open(file_name, 'wb') as file:
file.write(response.content)
for element in elements:
element.click()
png_elements = driver.find_elements(By.TAG_NAME, 'img')
threads = []
for png_element in png_elements:
png_url = png_element.get_attribute('src')
if png_url and png_url.endswith('.png'):
thread = threading.Thread(target=download_image, args=(png_url,))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
以上作業會模擬依序 click 網站當中的 className
元素,該元素會顯示出許多圖片,藉由 driver.find_elements
的方式找到 img 後,利用 multiple threading 的方式批次下載。
- By.ID
- By.NAME
- By.CLASS_NAME
- By.TAG_NAME
- By.LINK_TEXT
- By.PARTIAL_LINK_TEXT
- By.XPATH
- By.CSS_SELECTOR
向 docsify 類型網站取得動態渲染的 html 並且儲存
from driver import driver, By
from bs4 import BeautifulSoup
driver.get('https://docsify.js.org/#/configuration')
time.sleep(5)
content_element = driver.find_element(By.CLASS_NAME, "markdown-section")
content = content_element.get_attribute('outerHTML')
with open("configuration.html", "w", encoding="utf-8") as file:
file.write(content)
driver.quit()