Selenium hidden browser features
- Selenium Features
-
- 1.CDP
- 2. stealth.min.js
- 3. undetected_chromedriver
- 4. Operate the opened browser
- 4. Common ways to hide Selenium features
-
- 4.1 Modify navigator.webdriver flag
- 4.2 Change user-agent
- 4.3 Exclude or turn off some Selenium-related switches
- 4.4 Code display
- 4.5 Summary
Selenium features
When we use Selenium to crawl web pages, if we crawl directly without any processing, many features will be exposed.
We conducted feature detection on some websites that were anti-crawled to prevent some malicious crawlers.
Source URL:
https://blog.csdn.net/m0_67695717/article/details/128866017
https://blog.csdn.net/m0_67695717/article/details/130687622
https://blog.csdn.net/houmenghu/article/details/120489611
1. CDP
CDP stands for Chrome Devtools-Protocol
https://chromedevtools.github.io/devtools-protocol/
By executing the CDP command, you can run a piece of code before the web page is loaded, thereby changing the browser’s fingerprint characteristics.
For example, window.navigator.webdriver returns true when Selenium directly opens a web page; when opening a web page manually, the object value is undefined
Therefore, we can use the CDP command to modify the value of the object to hide the fingerprint characteristics.
from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service import time chrome_options = Options() s = Service(r"chromedriver.exe path") driver = webdriver.Chrome(service=s, options=chrome_options) # Execute the cdp command to modify the value of the (window.navigator.webdriver) object driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {<!-- --> "source": """ Object.defineProperty(navigator, 'webdriver', { get: () => undefined }) """ }) driver.get(url='URL') driver.save_screenshot('result.png') # save source = driver.page_source with open('result.html', 'w', encoding='utf-8') as f: f.write(source) time.sleep(200)
It should be pointed out that browsers have many fingerprint features, and there are some limitations in using this method.
2. stealth.min.js
This file contains common browser features. We only need to read the file and then execute the CDP command.
download link:
https://github.com/berstend/puppeteer-extra/tree/stealth-js
from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By import time chrome_options = Options() # Headless mode # chrome_options.add_argument("--headless") #Add request header chrome_options.add_argument( 'user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36') s = Service(r"chromedriver.exe path") driver = webdriver.Chrome(service=s, options=chrome_options) # Use stealth.min.js to hide browser fingerprint features # stealth.min.js download address: https://github.com/berstend/puppeteer-extra/tree/stealth-js with open('./stealth.min.js') as f: driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {<!-- --> "source": f.read() }) driver.get(url='URL') # driver.get(url='https://bot.sannysoft.com/') # save Picture driver.save_screenshot('result.png') time.sleep(200)
3. undetected_chromedriver
This is a dependency library that prevents browser fingerprint features from being recognized. It can automatically download the driver configuration and then run it.
Project address: https://github.com/ultrafunkamsterdam/undetected-chromedriver
First, we install the dependent libraries
# Install dependencies pip3 install undetected-chromedriver
Then, through the following lines of code, you can perfectly hide the fingerprint characteristics of the browser
from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service import time import undetected_chromedriver as uc chrome_options = Options() # chrome_options.add_argument("--headless") s = Service(r"chromedriver.exe") driver = uc.Chrome(service=s, options=chrome_options) driver.get(url='URL') # driver.get(url='https://bot.sannysoft.com/') driver.save_screenshot('result.png') time.sleep(100)
4. Operate an open browser
How to use Selenium to crawl open browsers!
We just need to start a browser via the command line
import subprocess # Use current browser # "C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222 # Create a newly configured browser, a folder will only be created once # "C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222 --user-data-dir="Just find an empty folder path" cmd = 'C:\Program Files\Google\Chrome\Application\chrome.exe --remote-debugging-port=9222 --user-data-dir="C:\selenum\user_data" ' subprocess.run(cmd)
Then, use Selenium to directly operate the above browser to simulate the normal behavior of operating the browser.
from selenium.webdriver import Chrome from selenium.webdriver.chrome.options import Options chrome_options = Options() chrome_options.add_experimental_option("debuggerAddress", "127.0.0.1:9222") # Note that I put the chromedriver file in the current folder, so it can be called like this # If you are a windows computer, you need to use ./chromedriver.exe driver = Chrome(options=chrome_options) driver.get('http://exercise.kingname.info/exercise_login_success') input('Enter any content to continue') driver.get('https://www.kingname.info') input('Enter any content to continue') driver.get('http://exercise.kingname.info/exercise_login_success')
4. Common ways to hide Selenium features
Hiding Selenium features is key to automated web testing. Through the following three methods, we can make the browser look more like a normal user and avoid being detected and denied access by websites. Based on the following selenium feature hiding methods, taking the collection of comments from Dianping as an example, combined with the proxy IP pool that needs to be used in the actual crawler collection process, the following demo is provided:
4.1 Modify navigator.webdriver flag
navigator.webdriver is a property provided by the browser to indicate whether the browser is controlled by webdriver. By default, the value of this flag is true if the browser is powered by Selenium, false otherwise. We can execute the Google Chrome DevTools command through the execute_cdp_cmd command to modify the value of this flag to false or undefined to hide Selenium features.
4.2 Change user-agent
user-agent is a string sent by the browser to the website to represent the browser type and version. Some websites will determine the user’s device and operating system based on the user-agent. If the user-agent is found to be outside the normal range, it will be suspected that it is a Selenium-driven browser. We can set the Network.setUserAgentOverride parameter through the execute_cdp_cmd command to change the user-agent to any value we want to hide the characteristics of Selenium
enable-automation and useAutomationExtension are two common Selenium-related switches that will affect the behavior and appearance of the browser, such as displaying the “Chrome is being controlled by automated software” prompt on the browser window. We can add or remove these switches through Chrome options to make the browser look more like a normal browser to hide Selenium features.
4.3 Exclude or turn off some Selenium-related switches
enable-automation and useAutomationExtension are two common Selenium-related switches that will affect the behavior and appearance of the browser, such as displaying the “Chrome is being controlled by automated software” prompt on the browser window. We can add or remove these switches through Chrome options to make the browser look more like a normal browser to hide Selenium features.
4.4 Code display
from selenium import webdriver ?from selenium.webdriver.common.proxy import Proxy, ProxyType # Yiniu Cloud Crawler Enhanced Edition proxy IP address, port number, user name and password proxy_address = 'www.16yun.cn' ?proxy_port = '3100' ?proxy_username = '16YUN' proxy_password = '16IP' # Set Chrome options, including hiding Selenium features, setting proxy IP, and excluding or turning off some Selenium-related switches options = webdriver.ChromeOptions() options.add_argument('--disable-blink-features=AutomationControlled') options.add_argument('--disable-extensions') options.add_argument('--disable-gpu') options.add_argument('--disable-infobars') options.add_argument('--disable-notifications') options.add_argument('--disable-popup-blocking') options.add_argument('--disable-web-security') options.add_argument('--ignore-certificate-errors') options.add_argument('--no-sandbox') options.add_argument('--start-maximized') options.add_argument('--user-data-dir=/dev/null') options.add_argument('--proxy-server={}'.format(proxy_address + ':' + proxy_port)) options.add_argument('--proxy-auth={}:{}'.format(proxy_username, proxy_password)) options.add_experimental_option('excludeSwitches', ['enable-automation', 'useAutomationExtension']) # Initialize the Chrome browser and use the above options driver = webdriver.Chrome(options=options) #Hide the navigator.webdriver flag and change its value to false or undefined ? driver.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {<!-- --> 'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})' }) # Set user-agent and change the value of user-agent ?user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36" driver.execute_cdp_cmd("Network.setUserAgentOverride", {<!-- -->"userAgent": user_agent}) # Visit the review page of the product on Dianping.com ?url = 'https://www.dianping.com/shop/1234567/review_all' driver.get(url) ? #Add additional code here to perform the tasks you want
4.5 Summary
This code will use the Chrome browser and when launching the browser use the options to hide Selenium features, set the proxy IP for username and password and exclude or turn off some Selenium related switches. Then, use the execute_cdp_cmd command to execute the command in the Google Chrome DevTools protocol and modify the value of the navigator.webdriver flag to false or undefined. Use the execute_cdp_cmd command to set the Network.setUserAgentOverride parameter and change the user-agent to the specified user-agent string. Finally, visit the product’s review page on Dianping and add additional code there to perform your desired tasks.