One article to understand the use of selenium and the installation of various drivers

1. Selenium simulated login

Table of Contents

  • Driver Installation
  • Baidu example
  • find node
  • action chain
  • run js
  • get node
  • Switch Frame
  • delay waiting
  • Tab
  • Anti-shield
  • headless mode

Previously, requests and other crawling urls are ready-made, and you can request directly, but if you add encrypted parameters, such as token, sign, etc., in addition to js reverse engineering, our more convenient operation is to directly simulate login to realize what you see and what you crawl.

Official website API
https://www.selenium.dev/documentation/

Driver installation

Installation address

Remember, be sure to find out the version number, otherwise it may not be compatible, the version number can be found in the browser settings.

  1. Edge
    https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/

  2. Chrome
    http://chromedriver.storage.googleapis.com/index.html

  3. Firefox
    https://github.com/mozilla/geckodriver/releases

How to install

The simple operation is to pull the exe file into the Python Scripts directory, so that the environment variables can be configured directly, because it is already installed when python is installed.

Be careful when installing edge, you need to copy msedgedriver.exe and name it MicrosoftWebDriver.exe and pull it into the directory together.

Baidu example

In this example, search keywords through selenium and obtain the source code and other content.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

browser = webdriver. Firefox()
try:
    browser.get("https://www.baidu.com")
    # Input box
    input_box = browser.find_element(By.ID, "kw")
    # keywords
    input_box. send_keys("Python")
    # carriage return
    input_box. send_keys(Keys. ENTER)
    wait = WebDriverWait(browser, 3)
    # Until positioning content_left
    wait.until(EC.presence_of_element_located((By.ID, "content_left")))
    print(browser. current_url)
    print(browser. get_cookies())
    # page source code
    print(browser. page_source)
finally:
    browser. close()

Find Node

class By:
    """Set of supported locator strategies."""

    ID = "id"
    XPATH = "xpath"
    LINK_TEXT = "link text"
    PARTIAL_LINK_TEXT = "partial link text"
    NAME = "name"
    TAG_NAME = "tag name"
    CLASS_NAME = "class name"
    CSS_SELECTOR = "css selector"
    

from selenium.webdriver.common.by import By
# single node
input_box = browser.find_element(By.ID, "kw")
# multiple nodes
input_box = browser. find_elements(By. CSS_SELECTOR , "li")

Action chain

1. send_keys() # input text
2. .clear() # Clear input
3. .click() # click
4. Page 227 # drag and drop


Run js

For example, when we pull down the js data that will be displayed, selenium does not provide api, we can execute js through execute_script

from selenium import webdriver

browser = webdriver. Firefox()

browser.get("https://www.zhihu.com/explore")
browser.execute_script("window.scrollTo(0, document.body.scrollHeight)")
browser.execute_script('alert("to the bottom")')

Get node

import time
from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver. Firefox()

browser.get("https://spa2.scrape.center/")
logo = browser. find_element(By. CLASS_NAME, "logo-image")
print(logo)
# get attributes
print(logo. get_attribute("src"))
# get text
#logo.text
# get id
#logo.id
# get location
#logo.location
# get width and height
# logo. size
time. sleep(1)
browser. close()

Switch Frame

There is an iframe in the web page, that is, a sub-frame, which is equivalent to a sub-page of the page. Selenium opens the parent page by default for operation, so we need to switch switch_to.frame to operate the sub-page

  1. error example
from selenium import webdriver
from selenium. webdriver import ActionChains
from selenium.webdriver.common.by import By

browser = webdriver. Firefox()

browser.get("https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable")
source = browser.find_element(By.ID, "draggable")
target = browser.find_element(By.ID, "droppable")
actions = ActionChains(browser)
actions. drag_and_drop(source, target)
actions. perform()

We found that an error that could not be located would be reported.

  1. switch frame
from selenium import webdriver
from selenium. webdriver import ActionChains
from selenium.webdriver.common.by import By

browser = webdriver. Firefox()

browser.get("https://www.runoob.com/try/try.php?filename=jqueryui-api-droppable")
# Switch to the specified frame
browser.switch_to.frame("iframeResult")
source = browser.find_element(By.ID, "draggable")
target = browser.find_element(By.ID, "droppable")
actions = ActionChains(browser)
actions. drag_and_drop(source, target)
# Execute the action
actions. perform()


# switch to parent frame
browser.switch_to.parent_frame("iframeResult")

Delay waiting

  1. implicit wait

Implicit wait is to wait for a period of time before searching the DOM, the default time is 0.

from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver. Firefox()
browser. implicitly_wait(5)

browser.get("https://spa2.scrape.center/")

logo = browser. find_element(By. CLASS_NAME, "logo-image")
print(logo)
  1. explicit wait

The disadvantage of implicit waiting is custom time, and the speed of some websites is affected by the network. The advantage of explicit waiting is that we set a maximum time for a node. If it is found in advance, it will return. If it times out, it will throw an exception.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver. Firefox()
browser.get("https://www.taobao.com/")
wait = WebDriverWait(browser, 5)

# node exists
input_box = wait.until(EC.presence_of_element_located((By.ID, "q")))
# button is clickable
button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".btn-search")))
print(input_box, button)
  1. waiting condition
from selenium.webdriver.support import expected_conditions as EC
EC.***
  1. forward, backward
browser.forward()
browser.back()

Tabs

import time
from selenium import webdriver

browser = webdriver. Firefox()
browser.get("https://www.baidu.com/")

# open a new tab
browser. execute_script("window. open()")
print(browser. window_handles)

# switch to the second
browser.switch_to.window(browser.window_handles[1])
browser.get("https://www.taobao.com/")
time. sleep(1)

# switch back to the first
browser.switch_to.window(browser.window_handles[0])
browser.get("https://zhihu.com")

Unblock

Some websites detect selenium. The principle of detection is to check whether the navigator object under the current browser window contains the webdriver attribute, because in normal use, the value of this attribute is undefined.

Typical example https://antispider1.scrape.center/

If we use selenium to crawl directly, Webdriver Forbidden will appear.

Let’s think about it first, the problem is that the webdriver attribute is defined, then we can set it to undefined through execute_script, let’s try it first.

browser.execute_script('Object.defineProperty(navigator, "webdriver", {get: () => undefined})')

The answer is no, the execute_script statement is executed after the page is loaded.

So how to do it, we can use Page.addScriptToEvaluateOnNewDocument this CDP (Development Tool Protocol) to implement js statement just after loading

import time
from selenium import webdriver
from selenium. webdriver import EdgeOptions

option = EdgeOptions()
option.add_experimental_option("excludeSwitches", ["enable-automation"])
option.add_experimental_option("useAutomationExtension", False)
browser = webdriver. Edge(options=option)
browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {<!-- -->
    "source": 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})'
})
browser.get("https://antispider1.scrape.center/")
time. sleep(5)

Headless mode

The browser window can be canceled, and at the same time, it can reduce the loading of some resources, saving loading time and bandwidth

from selenium import webdriver
from selenium. webdriver import EdgeOptions

option = EdgeOptions()
# headless mode
option. add_argument("--headless")
browser = webdriver. Edge(options=option)
browser.set_window_rect(1366, 768)
browser.get("https://www.baidu.com/")
# preview screenshot, don't write
browser.get_screenshot_as_file("previes.png")