1. Selenium simulated login
Table of Contents
- Driver Installation
- Baidu example
- find node
- action chain
- run js
- get node
- Switch Frame
- delay waiting
- Tab
- Anti-shield
- headless mode
Previously, requests and other crawling urls are ready-made, and you can request directly, but if you add encrypted parameters, such as token, sign, etc., in addition to js reverse engineering, our more convenient operation is to directly simulate login to realize what you see and what you crawl.
Official website API
https://www.selenium.dev/documentation/
Driver installation
Installation address
Remember, be sure to find out the version number, otherwise it may not be compatible, the version number can be found in the browser settings.
-
Edge
https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/ -
Chrome
http://chromedriver.storage.googleapis.com/index.html -
Firefox
https://github.com/mozilla/geckodriver/releases
How to install
The simple operation is to pull the exe file into the Python Scripts directory, so that the environment variables can be configured directly, because it is already installed when python is installed.
Be careful when installing edge, you need to copy msedgedriver.exe and name it MicrosoftWebDriver.exe and pull it into the directory together.
Baidu example
In this example, search keywords through selenium and obtain the source code and other content.
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait browser = webdriver. Firefox() try: browser.get("https://www.baidu.com") # Input box input_box = browser.find_element(By.ID, "kw") # keywords input_box. send_keys("Python") # carriage return input_box. send_keys(Keys. ENTER) wait = WebDriverWait(browser, 3) # Until positioning content_left wait.until(EC.presence_of_element_located((By.ID, "content_left"))) print(browser. current_url) print(browser. get_cookies()) # page source code print(browser. page_source) finally: browser. close()
Find Node
class By: """Set of supported locator strategies.""" ID = "id" XPATH = "xpath" LINK_TEXT = "link text" PARTIAL_LINK_TEXT = "partial link text" NAME = "name" TAG_NAME = "tag name" CLASS_NAME = "class name" CSS_SELECTOR = "css selector" from selenium.webdriver.common.by import By # single node input_box = browser.find_element(By.ID, "kw") # multiple nodes input_box = browser. find_elements(By. CSS_SELECTOR , "li")
Action chain
1. send_keys() # input text 2. .clear() # Clear input 3. .click() # click 4. Page 227 # drag and drop
Run js
For example, when we pull down the js data that will be displayed, selenium does not provide api, we can execute js through execute_script
from selenium import webdriver browser = webdriver. Firefox() browser.get("https://www.zhihu.com/explore") browser.execute_script("window.scrollTo(0, document.body.scrollHeight)") browser.execute_script('alert("to the bottom")')
Get node
import time from selenium import webdriver from selenium.webdriver.common.by import By browser = webdriver. Firefox() browser.get("https://spa2.scrape.center/") logo = browser. find_element(By. CLASS_NAME, "logo-image") print(logo) # get attributes print(logo. get_attribute("src")) # get text #logo.text # get id #logo.id # get location #logo.location # get width and height # logo. size time. sleep(1) browser. close()
Switch Frame
There is an iframe in the web page, that is, a sub-frame, which is equivalent to a sub-page of the page. Selenium opens the parent page by default for operation, so we need to switch switch_to.frame to operate the sub-page
- error example
from selenium import webdriver from selenium. webdriver import ActionChains from selenium.webdriver.common.by import By browser = webdriver. Firefox() browser.get("https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable") source = browser.find_element(By.ID, "draggable") target = browser.find_element(By.ID, "droppable") actions = ActionChains(browser) actions. drag_and_drop(source, target) actions. perform()
We found that an error that could not be located would be reported.
- switch frame
from selenium import webdriver from selenium. webdriver import ActionChains from selenium.webdriver.common.by import By browser = webdriver. Firefox() browser.get("https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable") # Switch to the specified frame browser.switch_to.frame("iframeResult") source = browser.find_element(By.ID, "draggable") target = browser.find_element(By.ID, "droppable") actions = ActionChains(browser) actions. drag_and_drop(source, target) # Execute the action actions. perform() # switch to parent frame browser.switch_to.parent_frame("iframeResult")
Delay waiting
- implicit wait
Implicit wait is to wait for a period of time before searching the DOM, the default time is 0.
from selenium import webdriver from selenium.webdriver.common.by import By browser = webdriver. Firefox() browser. implicitly_wait(5) browser.get("https://spa2.scrape.center/") logo = browser. find_element(By. CLASS_NAME, "logo-image") print(logo)
- explicit wait
The disadvantage of implicit waiting is custom time, and the speed of some websites is affected by the network. The advantage of explicit waiting is that we set a maximum time for a node. If it is found in advance, it will return. If it times out, it will throw an exception.
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC browser = webdriver. Firefox() browser.get("https://www.taobao.com/") wait = WebDriverWait(browser, 5) # node exists input_box = wait.until(EC.presence_of_element_located((By.ID, "q"))) # button is clickable button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".btn-search"))) print(input_box, button)
- waiting condition
from selenium.webdriver.support import expected_conditions as EC EC.***
- forward, backward
browser.forward() browser.back()
Tabs
import time from selenium import webdriver browser = webdriver. Firefox() browser.get("https://www.baidu.com/") # open a new tab browser. execute_script("window. open()") print(browser. window_handles) # switch to the second browser.switch_to.window(browser.window_handles[1]) browser.get("https://www.taobao.com/") time. sleep(1) # switch back to the first browser.switch_to.window(browser.window_handles[0]) browser.get("https://zhihu.com")
Unblock
Some websites detect selenium. The principle of detection is to check whether the navigator object under the current browser window contains the webdriver attribute, because in normal use, the value of this attribute is undefined.
Typical example https://antispider1.scrape.center/
If we use selenium to crawl directly, Webdriver Forbidden will appear.
Let’s think about it first, the problem is that the webdriver attribute is defined, then we can set it to undefined through execute_script, let’s try it first.
browser.execute_script('Object.defineProperty(navigator, "webdriver", {get: () => undefined})')
The answer is no, the execute_script statement is executed after the page is loaded.
So how to do it, we can use Page.addScriptToEvaluateOnNewDocument this CDP (Development Tool Protocol) to implement js statement just after loading
import time from selenium import webdriver from selenium. webdriver import EdgeOptions option = EdgeOptions() option.add_experimental_option("excludeSwitches", ["enable-automation"]) option.add_experimental_option("useAutomationExtension", False) browser = webdriver. Edge(options=option) browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {<!-- --> "source": 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})' }) browser.get("https://antispider1.scrape.center/") time. sleep(5)
Headless mode
The browser window can be canceled, and at the same time, it can reduce the loading of some resources, saving loading time and bandwidth
from selenium import webdriver from selenium. webdriver import EdgeOptions option = EdgeOptions() # headless mode option. add_argument("--headless") browser = webdriver. Edge(options=option) browser.set_window_rect(1366, 768) browser.get("https://www.baidu.com/") # preview screenshot, don't write browser.get_screenshot_as_file("previes.png")