Selenium hidden browser features

Selenium Features
- 1.CDP
- 2. stealth.min.js
- 3. undetected_chromedriver
- 4. Operate the opened browser
- 4. Common ways to hide Selenium features
- - 4.1 Modify navigator.webdriver flag
  - 4.2 Change user-agent
  - 4.3 Exclude or turn off some Selenium-related switches
  - 4.4 Code display
  - 4.5 Summary

Selenium features

When we use Selenium to crawl web pages, if we crawl directly without any processing, many features will be exposed.
We conducted feature detection on some websites that were anti-crawled to prevent some malicious crawlers.

Source URL:
https://blog.csdn.net/m0_67695717/article/details/128866017
https://blog.csdn.net/m0_67695717/article/details/130687622
https://blog.csdn.net/houmenghu/article/details/120489611

1. CDP

CDP stands for Chrome Devtools-Protocol

https://chromedevtools.github.io/devtools-protocol/

By executing the CDP command, you can run a piece of code before the web page is loaded, thereby changing the browser’s fingerprint characteristics.

For example, window.navigator.webdriver returns true when Selenium directly opens a web page; when opening a web page manually, the object value is undefined

Therefore, we can use the CDP command to modify the value of the object to hide the fingerprint characteristics.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
import time

chrome_options = Options()

s = Service(r"chromedriver.exe path")

driver = webdriver.Chrome(service=s, options=chrome_options)

# Execute the cdp command to modify the value of the (window.navigator.webdriver) object
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {<!-- -->
    "source": """
            Object.defineProperty(navigator, 'webdriver', {
              get: () => undefined
            })
            """
})

driver.get(url='URL')

driver.save_screenshot('result.png')

# save
source = driver.page_source
with open('result.html', 'w', encoding='utf-8') as f:
    f.write(source)

time.sleep(200)

It should be pointed out that browsers have many fingerprint features, and there are some limitations in using this method.

2. stealth.min.js

This file contains common browser features. We only need to read the file and then execute the CDP command.

download link:

https://github.com/berstend/puppeteer-extra/tree/stealth-js

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time

chrome_options = Options()

# Headless mode
# chrome_options.add_argument("--headless")

#Add request header
chrome_options.add_argument(
    'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36')

s = Service(r"chromedriver.exe path")

driver = webdriver.Chrome(service=s, options=chrome_options)

# Use stealth.min.js to hide browser fingerprint features
# stealth.min.js download address: https://github.com/berstend/puppeteer-extra/tree/stealth-js
with open('./stealth.min.js') as f:
    driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {<!-- -->
        "source": f.read()
    })

driver.get(url='URL')
# driver.get(url='https://bot.sannysoft.com/')

# save Picture
driver.save_screenshot('result.png')

time.sleep(200)

3. undetected_chromedriver

This is a dependency library that prevents browser fingerprint features from being recognized. It can automatically download the driver configuration and then run it.

Project address: https://github.com/ultrafunkamsterdam/undetected-chromedriver

First, we install the dependent libraries

# Install dependencies
pip3 install undetected-chromedriver

Then, through the following lines of code, you can perfectly hide the fingerprint characteristics of the browser

from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
import time
import undetected_chromedriver as uc

chrome_options = Options()
# chrome_options.add_argument("--headless")

s = Service(r"chromedriver.exe")

driver = uc.Chrome(service=s, options=chrome_options)

driver.get(url='URL')
# driver.get(url='https://bot.sannysoft.com/')

driver.save_screenshot('result.png')
time.sleep(100)

4. Operate an open browser

How to use Selenium to crawl open browsers!

We just need to start a browser via the command line

import subprocess

# Use current browser
# "C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222
# Create a newly configured browser, a folder will only be created once
# "C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222 --user-data-dir="Just find an empty folder path"

cmd = 'C:\Program Files\Google\Chrome\Application\chrome.exe --remote-debugging-port=9222 --user-data-dir="C:\selenum\user_data" '
subprocess.run(cmd)

Then, use Selenium to directly operate the above browser to simulate the normal behavior of operating the browser.

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222")
# Note that I put the chromedriver file in the current folder, so it can be called like this
# If you are a windows computer, you need to use ./chromedriver.exe
driver = Chrome(options=chrome_options)

driver.get('http://exercise.kingname.info/exercise_login_success')
input('Enter any content to continue')
driver.get('https://www.kingname.info')
input('Enter any content to continue')
driver.get('http://exercise.kingname.info/exercise_login_success')

4. Common ways to hide Selenium features

Hiding Selenium features is key to automated web testing. Through the following three methods, we can make the browser look more like a normal user and avoid being detected and denied access by websites. Based on the following selenium feature hiding methods, taking the collection of comments from Dianping as an example, combined with the proxy IP pool that needs to be used in the actual crawler collection process, the following demo is provided:

4.1 Modify navigator.webdriver flag

navigator.webdriver is a property provided by the browser to indicate whether the browser is controlled by webdriver. By default, the value of this flag is true if the browser is powered by Selenium, false otherwise. We can execute the Google Chrome DevTools command through the execute_cdp_cmd command to modify the value of this flag to false or undefined to hide Selenium features.

4.2 Change user-agent

user-agent is a string sent by the browser to the website to represent the browser type and version. Some websites will determine the user’s device and operating system based on the user-agent. If the user-agent is found to be outside the normal range, it will be suspected that it is a Selenium-driven browser. We can set the Network.setUserAgentOverride parameter through the execute_cdp_cmd command to change the user-agent to any value we want to hide the characteristics of Selenium
enable-automation and useAutomationExtension are two common Selenium-related switches that will affect the behavior and appearance of the browser, such as displaying the “Chrome is being controlled by automated software” prompt on the browser window. We can add or remove these switches through Chrome options to make the browser look more like a normal browser to hide Selenium features.

4.3 Exclude or turn off some Selenium-related switches

enable-automation and useAutomationExtension are two common Selenium-related switches that will affect the behavior and appearance of the browser, such as displaying the “Chrome is being controlled by automated software” prompt on the browser window. We can add or remove these switches through Chrome options to make the browser look more like a normal browser to hide Selenium features.

4.4 Code display

from selenium import webdriver
?from selenium.webdriver.common.proxy import Proxy, ProxyType
# Yiniu Cloud Crawler Enhanced Edition proxy IP address, port number, user name and password
proxy_address = 'www.16yun.cn'
?proxy_port = '3100'
?proxy_username = '16YUN'
proxy_password = '16IP'
# Set Chrome options, including hiding Selenium features, setting proxy IP, and excluding or turning off some Selenium-related switches
options = webdriver.ChromeOptions()
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('--disable-extensions')
options.add_argument('--disable-gpu')
options.add_argument('--disable-infobars')
options.add_argument('--disable-notifications')
options.add_argument('--disable-popup-blocking')
options.add_argument('--disable-web-security')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--no-sandbox')
options.add_argument('--start-maximized')
options.add_argument('--user-data-dir=/dev/null')
options.add_argument('--proxy-server={}'.format(proxy_address + ':' + proxy_port))
options.add_argument('--proxy-auth={}:{}'.format(proxy_username, proxy_password))
options.add_experimental_option('excludeSwitches', ['enable-automation', 'useAutomationExtension'])
# Initialize the Chrome browser and use the above options
driver = webdriver.Chrome(options=options)
#Hide the navigator.webdriver flag and change its value to false or undefined
? driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {<!-- -->
'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})'
})
# Set user-agent and change the value of user-agent
?user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
driver.execute_cdp_cmd("Network.setUserAgentOverride", {<!-- -->"userAgent": user_agent})
# Visit the review page of the product on Dianping.com
?url = 'https://www.dianping.com/shop/1234567/review_all'
driver.get(url)
? #Add additional code here to perform the tasks you want

4.5 Summary

This code will use the Chrome browser and when launching the browser use the options to hide Selenium features, set the proxy IP for username and password and exclude or turn off some Selenium related switches. Then, use the execute_cdp_cmd command to execute the command in the Google Chrome DevTools protocol and modify the value of the navigator.webdriver flag to false or undefined. Use the execute_cdp_cmd command to set the Network.setUserAgentOverride parameter and change the user-agent to the specified user-agent string. Finally, visit the product’s review page on Dianping and add additional code there to perform your desired tasks.