Use selenium to simulate login in scrapy to get cookie

Preface

Recently, I have a little demand for crawlers. I want to summarize some basic usage methods of the scrapy framework to deepen my impression. I have been used to using some script files to run crawlers. Faced with a very large amount of data, relatively high stability requirements, and relatively high efficiency requirements Under the circumstances, it is more appropriate to use scrapy. Scrapy is an asynchronous framework, and all requests are blocked. Although it can also be implemented in a single-file script, the code is very ugly and difficult to maintain. It will be forgotten after a few days Complicated processes are very difficult to debug. I don’t use scrapy much myself, but I think it is very good, with mature middleware support, convenient downloader, and high stability and efficiency. However, the running The process is also a little complicated and difficult to understand, and the asynchronous framework is very troublesome to deal with BUG debugging.

Initialize scrapy

First you need to install scrapy and selenium framework.

pip install scrapy

pip install selenium

Initialize the frame

scrapy startproject testSpider

Then enter the folder according to the reference and create a new crawler file.

cd testSpider

scrapy genspider myspider example.com

look at the catalog

Basic use of selenium

Selenium Preface

Today I will only talk about the basic use of selenium. The process of the scrapy framework will be summarized later. Why use selenium in scrapy, because the interface of some target sites is very difficult to reproduce through analysis, and there are usually some confusing parameters that cause request encounters In the case of interception, that is, when encountering anti-crawler measures, it is necessary to analyze the Javascript code and analyze the meaning of the parameters. This process is very complicated and the amount of work is very large. This is also the necessary knowledge for advanced crawlers at present, and some Javascript is required Reverse knowledge, for example, there is a very famous Ruishu information in the industry that has made a comeback, which belongs to the top existence. It is specially used in some financial and government websites to use this Javascript code obfuscation technology, I also understand a little bit.

Through selenium, you can bypass some key anti-pickup interfaces and get some important information. The usual situation is to use selenium to simulate the login interface with anti-climbing measures to get the cookie after login, and then the interface after login has no anti-climbing measures.

Download Driver

Using selenium requires a matching browser driver. My supporting browser is chrome, my own browser version.

Downloaded version:

Then I put the browser driver in the browser directory, and then configured the environment variables:

Key code

testSpider/spider/myspider.py is the key code, the current code is as follows:

 import scrapy
class MyspiderSpider(scrapy. Spider):
name = 'myspider'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
def parse(self, response):
pass

According to the requirements of the title of the article, I only need to use the above files for coding. If I want to use other interface crawling methods, I need to change the settings in testSpider/setting.py. If you are interested, you can refer to my previous article. : Use the Scrapy framework to crawl V2ex to see what programmers are discussing during the Mid-Autumn Festival

Brighten the code directly, and try the login of Qiniuyun, because I feel that it is relatively simple, with fewer steps, and is suitable for tutorial sharing. The details are explained in the notes:

 import scrapy
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class MyspiderSpider(scrapy. Spider):
name = 'myspider'
allowed_domains = ['portal.qiniu.com'] # Pay attention to setting the list of URLs that are allowed to be crawled. I stepped on the pit here. After trying for a long time, I found that the default parser of the framework has not been called. It is necessary to write the integrated domain name of the URL instead of just a single one. level domain name.
start_urls = ['http://example.com/']
user_name = '********@**.com'
password = '********'
chorme_options = Options()
chorme_options.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=chorme_options) # Initialize Chrome driver
driver. implicitly_wait(20)
headers = {
'authority': 'portal.qiniu.com',
'accept': '*/*',
'accept-language': 'zh-CN,zh;q=0.9',
'cache-control': 'no-cache',
'referer': 'https://portal.qiniu.com/certificate/ssl',
'sec-ch-ua': '".Not/A)Brand";v="99", "Google Chrome";v="103", "Chromium";v="103"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36',
} # header
cookie = {}
def __del__(self):
self.driver.close()
def parse(self, response, *args, **kwargs):
print('Default parser method, interface request content:')
print(response. json())
def start_requests(self):
self.driver.get(url='https://sso.qiniu.com/') # directly access the login page
user_input = self.driver.find_element(By.ID, 'email') # Get the user name input box
user_input.send_keys(self.user_name) # input user name
password_input = self.driver.find_element(By.ID, 'password') # get password box
password_input.send_keys(self.password) # input password
self.driver.find_element(By.ID, 'login-button').click() # login
try:
WebDriverWait(self.driver, 60).until(EC.visibility_of_element_located(
(By.CLASS_NAME, "user-plane-entry"))) # Wait for the webpage to jump, and wait for 60 seconds after timeout
except:
print('Login timed out, failed') # Waiting for more than 60
self.driver.quit()
self.cookie = self.driver.get_cookies() # Obtaining cookies is a key-value pair at this time
print(self. cookie)
print(self. headers)
yield scrapy.Request(url='https://portal.qiniu.com/api/gaea/billboard/list?status=1', callback=self.parse,
cookies=self.cookie,
headers=self.headers) # The iterator puts the request into the asynchronous task

Look at the effect from the log:

The content returned by the interface has been printed correctly. If the interface is requested directly, an error will be reported:

Summary

Using selenium in scrapy is a very common situation. Today I will only make a brief summary. The difficulties encountered in the future will be recorded and shared one by one. Please look forward to it.

Finally: The complete software testing video tutorial below has been sorted out and uploaded, friends who need it can get it by themselves [100% free guarantee]

Software Testing Interview Document

We must study to find a high-paying job. The following interview questions are the latest interview materials from first-tier Internet companies such as Ali, Tencent, and Byte, and some Byte bosses have given authoritative answers. Finish this set The interview materials believe that everyone can find a satisfactory job.

Summary

Software Testing Interview Document

Get all data